# Loading Data Files in Python

## Supported Formats

Python can load the following data file formats:

* Compressed CSV (`.csv.gz`)
* ORC (`.orc`)
* Parquet (`.parquet`)

## Download the Data Files

Download and save the data files to a suitable location. In the examples that follow, the data has been saved to `C:\data`.&#x20;

{% hint style="info" %}
Although you could load data files directly from the data file URLs, this is not recommended because you may quickly hit usage limits or incur additional costs. We always recommend saving the files locally or to cloud storage first using the Open Data Blend Dataset UI, Open Data Blend Dataset API, or [Open Data Blend for Python](#open-data-blend-for-python).
{% endhint %}

## Loading Compressed (Gzip) CSV Data Files

You can use the below steps as a guide on how you can load compressed (Gzip) data files into Python.

Install the `pandas` module from the Anaconda prompt.

```python
pip install pandas
```

Import the `pandas` module.

```python
import pandas as pd
```

Read the compressed CSV data file into a data frame.

```python
pd.read_csv(r"C:\data\date.csv.gz")
```

## Loading ORC Data Files

You can use the below steps as a guide on how you can load ORC data files into Python.

{% hint style="info" %}
There was a [known issue](https://issues.apache.org/jira/browse/ARROW-7811) with the `pyarrow.orc` module, which prevented it from working correctly. If you experience the same issue, try upgrading the module to a newer version that includes the fix.
{% endhint %}

Install `pandas` and the last version of `pyarrow` modules.

```python
pip install pandas
pip install pyarrow
```

Import the `pandas` and `pyarrow` modules.

```python
import pandas as pd
import pyarrow.orc as orc
```

Read the ORC data file into a data frame.

```python
df_date = pd.read_orc(r"C:\data\date\date.orc")
```

Read a subset of the columns from the Parquet data file into a data frame.

```python
df_mot_results_2017 = pd.read_orc(r"C:\data\anonymised_mot_test_result\anonymised_mot_test_result_2017.parquet", columns = ["drv_anonymised_mot_test_date_key", "drv_anonymised_mot_test_result_info_key"])
```

## Loading Parquet Data Files

You can use the below steps as a guide on how you can load Parquet data files in Python.

Install the Pandas and Apache Arrow modules.

```python
pip install pandas
pip install pyarrow
```

Import the Pandas and Apache Arrow modules.

```python
import pandas as pd
import pyarrow.paquet as pq
```

Read the Parquet data file into a data frame.

```python
df_date = pd.read_parquet(r"C:\data\date\date.parquet")
```

Read a subset of the columns from the Parquet data file into a data frame.

```python
df_mot_results_2017 = pd.read_parquet(r"C:\data\anonymised_mot_test_result\anonymised_mot_test_result_2017.parquet", columns = ["drv_anonymised_mot_test_date_key", "drv_anonymised_mot_test_result_info_key"])
```

{% hint style="info" %}
When working with larger data files, it is a good practice to only read the required columns because it will reduce the read times, memory footprint, and processing times.
{% endhint %}

## Open Data Blend for Python

Open Data Blend for Python is the recommended method for ingesting Open Data Blend datasets using Python. You can use the Python package called `opendatablend` to quickly copy data files and the corresponding dataset metadata to your local machine or supported data lake storage.

Install the opendatablend module.

```python
pip install opendatablend
```

Get the data.

```python
import opendatablend as odb

dataset_path = 'https://packages.opendatablend.io/v1/open-data-blend-road-safety/datapackage.json'

# Specify the resource name of the data file. In this example, the 'date' data file will be requested in .parquet format.
resource_name = 'date-parquet'

# Get the data and store the output object
output = odb.get_data(dataset_path, resource_name)

# Print the file locations
print(output.data_file_name)
print(output.metadata_file_name)
```

Use the data.

```python
import pandas as pd

# Read a subset of the columns into a dataframe
df_date = pd.read_parquet(output.data_file_name, columns=['drv_date_key', 'drv_date', 'drv_month_name', 'drv_month_number', 'drv_quarter_name', 'drv_quarter_number', 'drv_year'])

# Check the contents of the dataframe
df_date
```

You can learn more about Open Data Blend for Python including how to use it to ingest data into supported data lakes [here](https://github.com/opendatablend/opendatablend-py).

## Using Python for Data Analysis

Guidance on how to analyse data in Python is beyond the scope of this documentation.&#x20;

You may find the following helpful:

* [Pandas documentation](https://pandas.pydata.org/docs/) (suitable for small to medium-sized data files)
* [Mondin documentation](https://modin.readthedocs.io/en/stable/) (suitable for data of any size, especially very large data files)
* [Pandas API on Spark documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) (suitable for data of any size, especially very large data files)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatablend.io/open-data-blend-datasets/loading-data-files-in-python.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
