Open Data Blend Docs
  • Introduction
  • Open Data Blend Datasets
    • Datasets
    • Dataset Versions
    • Dataset UI
    • Dataset API
    • Frictionless Data Compatibility
    • Modelling Conventions
    • Loading Data Files in Excel
    • Loading Data Files in Power BI Desktop
    • Loading Data Files in Tableau Desktop
    • Loading Data Files in Python
    • Loading Data Files in R
    • Loading Data Files in Other Tools
  • Open Data Blend Analytics
    • Analytics
    • Analytics Queries
    • Analytics Users
    • Connecting from Excel
    • Connecting from Power BI Desktop
    • Connecting from Tableau Desktop
    • Connecting from Other Tools
  • Open Data Blend Insights
    • Insights
    • Report Drill Throughs
    • Report Drill Downs
  • Subscription Management
    • Subscription Portal
    • Managing Analytics Users
    • Managing Access Keys
    • Updating Payment Details
Powered by GitBook
On this page
  • Supported Formats
  • Download the Data Files
  • Loading Compressed (Gzip) CSV Data Files
  • Loading ORC Data Files
  • Loading Parquet Data Files
  • Open Data Blend for Python
  • Using Python for Data Analysis

Was this helpful?

  1. Open Data Blend Datasets

Loading Data Files in Python

PreviousLoading Data Files in Tableau DesktopNextLoading Data Files in R

Last updated 1 year ago

Was this helpful?

Supported Formats

Python can load the following data file formats:

  • Compressed CSV (.csv.gz)

  • ORC (.orc)

  • Parquet (.parquet)

Download the Data Files

Download and save the data files to a suitable location. In the examples that follow, the data has been saved to C:\data.

Although you could load data files directly from the data file URLs, this is not recommended because you may quickly hit usage limits or incur additional costs. We always recommend saving the files locally or to cloud storage first using the Open Data Blend Dataset UI, Open Data Blend Dataset API, or .

Loading Compressed (Gzip) CSV Data Files

You can use the below steps as a guide on how you can load compressed (Gzip) data files into Python.

Install the pandas module from the Anaconda prompt.

pip install pandas

Import the pandas module.

import pandas as pd

Read the compressed CSV data file into a data frame.

pd.read_csv(r"C:\data\date.csv.gz")

Loading ORC Data Files

You can use the below steps as a guide on how you can load ORC data files into Python.

Install pandas and the last version of pyarrow modules.

pip install pandas
pip install pyarrow

Import the pandas and pyarrow modules.

import pandas as pd
import pyarrow.orc as orc

Read the ORC data file into a data frame.

df_date = pd.read_orc(r"C:\data\date\date.orc")

Read a subset of the columns from the Parquet data file into a data frame.

df_mot_results_2017 = pd.read_orc(r"C:\data\anonymised_mot_test_result\anonymised_mot_test_result_2017.parquet", columns = ["drv_anonymised_mot_test_date_key", "drv_anonymised_mot_test_result_info_key"])

Loading Parquet Data Files

You can use the below steps as a guide on how you can load Parquet data files in Python.

Install the Pandas and Apache Arrow modules.

pip install pandas
pip install pyarrow

Import the Pandas and Apache Arrow modules.

import pandas as pd
import pyarrow.paquet as pq

Read the Parquet data file into a data frame.

df_date = pd.read_parquet(r"C:\data\date\date.parquet")

Read a subset of the columns from the Parquet data file into a data frame.

df_mot_results_2017 = pd.read_parquet(r"C:\data\anonymised_mot_test_result\anonymised_mot_test_result_2017.parquet", columns = ["drv_anonymised_mot_test_date_key", "drv_anonymised_mot_test_result_info_key"])

When working with larger data files, it is a good practice to only read the required columns because it will reduce the read times, memory footprint, and processing times.

Open Data Blend for Python

Open Data Blend for Python is the recommended method for ingesting Open Data Blend datasets using Python. You can use the Python package called opendatablend to quickly copy data files and the corresponding dataset metadata to your local machine or supported data lake storage.

Install the opendatablend module.

pip install opendatablend

Get the data.

import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in .parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object
output = odb.get_data(dataset_path, resource_name)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)

Use the data.

import pandas as pd

# Read a subset of the columns into a dataframe
df_date = pd.read_parquet(output.data_file_name, columns=['drv_date_key', 'drv_date', 'drv_month_name', 'drv_month_number', 'drv_quarter_name', 'drv_quarter_number', 'drv_year'])

# Check the contents of the dataframe
df_date

Using Python for Data Analysis

Guidance on how to analyse data in Python is beyond the scope of this documentation.

You may find the following helpful:

There was a with the pyarrow.orc module, which prevented it from working correctly. If you experience the same issue, try upgrading the module to a newer version that includes the fix.

You can learn more about Open Data Blend for Python including how to use it to ingest data into supported data lakes .

(suitable for small to medium-sized data files)

(suitable for data of any size, especially very large data files)

(suitable for data of any size, especially very large data files)

known issue
here
Pandas documentation
Mondin documentation
Pandas API on Spark documentation
Open Data Blend for Python