Loading Data Files in Python

Supported Formats

Python can load the following data file formats:
    Compressed CSV (.csv.gz)
    ORC (.orc)
    Parquet (.parquet)

Download the Data Files

Download and save the data files to a suitable location. In the examples that follow, the data has been saved to C:\data.
Although you could load data files directly from the data file URLs, this is not recommended because you may quickly hit usage limits or incur additional costs. We always recommend saving the files locally first using the Open Data Blend Dataset UI or the Open Data Blend Dataset API.

Loading Compressed (Gzip) CSV Data Files

You can use the below steps as a guide on how you can load compressed (Gzip) data files into Python.
Install the pandas module from the Anaconda prompt.
1
pip install pandas
Copied!
Import the pandas module.
1
import pandas as pd
Copied!
Read the compressed CSV data file into a data frame.
1
pd.read_csv(r"C:\data\date.csv.gz")
Copied!

Loading ORC Data Files

You can use the below steps as a guide on how you can load ORC data files into Python.
There is currently a known issue with the pyarrow.orc module which prevents it from working correctly. Until this is resolved, we recommend using the pyorc module or using the Parquet version of the data files instead.
Install pandas and the last version of pyarrow modules.
1
pip install pandas
2
pip install pyarrow
Copied!
Import the pandas and pyarrow modules.
1
import pandas as pd
2
import pyarrow.orc as orc
Copied!
Read the ORC data file into a data frame.
1
df_date = pd.read_orc(r"C:\data\date\date.orc")
Copied!
Read a subset of the columns from the Parquet data file into a data frame.
1
df_mot_results_2017 = pd.read_orc(r"C:\data\anonymised_mot_test_result\anonymised_mot_test_result_2017.parquet", columns = ["drv_anonymised_mot_test_date_key", "drv_anonymised_mot_test_result_info_key"])
Copied!

Loading Parquet Data Files

You can use the below steps as a guide on how you can load Parquet data files in Python.
Install the Pandas and Apache Arrow modules.
1
pip install pandas
2
pip install pyarrow
Copied!
Import the Pandas and Apache Arrow modules.
1
import pandas as pd
2
import pyarrow.paquet as pq
Copied!
Read the Parquet data file into a data frame.
1
df_date = pd.read_parquet(r"C:\data\date\date.parquet")
Copied!
Read a subset of the columns from the Parquet data file into a data frame.
1
df_mot_results_2017 = pd.read_parquet(r"C:\data\anonymised_mot_test_result\anonymised_mot_test_result_2017.parquet", columns = ["drv_anonymised_mot_test_date_key", "drv_anonymised_mot_test_result_info_key"])
Copied!
When working with larger data files, it is a good practice to only read the required columns because it will reduce the read times, memory footprint, and processing times.

Open Data Blend for Python

You can use our Python package called opendatablend to easily cache our data files and the corresponding dataset metadata locally.
Install the opendatablend module.
1
pip install opendatablend
Copied!
Get the data.
1
import opendatablend as odb
2
3
dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'
4
5
# Specify the resource name of the data file. In this example, the 'date' data file will be requested in .parquet format.
6
resource_name = 'date-parquet'
7
8
# Get the data and store the output object
9
output = odb.get_data(dataset_path, resource_name)
10
11
# Print the file locations
12
print(output.data_file_name)
13
print(output.metadata_file_name)
Copied!
Use the data.
1
import pandas as pd
2
3
# Read a subset of the columns into a dataframe
4
df_date = pd.read_parquet(output.data_file_name, columns=['drv_date_key', 'drv_date', 'drv_month_name', 'drv_month_number', 'drv_quarter_name', 'drv_quarter_number', 'drv_year'])
5
6
# Check the contents of the dataframe
7
df_date
Copied!
You can learn more about Open Data Blend for Python and see other examples here.

Using Python for Data Analysis

Guidance on how to analyse data in Python is beyond the scope of this documentation.
You may find the following helpful:
Last modified 3mo ago