Loading Data Files in Python
Last updated
Was this helpful?
Last updated
Was this helpful?
Python can load the following data file formats:
Compressed CSV (.csv.gz
)
ORC (.orc
)
Parquet (.parquet
)
Download and save the data files to a suitable location. In the examples that follow, the data has been saved to C:\data
.
You can use the below steps as a guide on how you can load compressed (Gzip) data files into Python.
Install the pandas
module from the Anaconda prompt.
Import the pandas
module.
Read the compressed CSV data file into a data frame.
You can use the below steps as a guide on how you can load ORC data files into Python.
Install pandas
and the last version of pyarrow
modules.
Import the pandas
and pyarrow
modules.
Read the ORC data file into a data frame.
Read a subset of the columns from the Parquet data file into a data frame.
You can use the below steps as a guide on how you can load Parquet data files in Python.
Install the Pandas and Apache Arrow modules.
Import the Pandas and Apache Arrow modules.
Read the Parquet data file into a data frame.
Read a subset of the columns from the Parquet data file into a data frame.
Open Data Blend for Python is the recommended method for ingesting Open Data Blend datasets using Python. You can use the Python package called opendatablend
to quickly copy data files and the corresponding dataset metadata to your local machine or supported data lake storage.
Install the opendatablend module.
Get the data.
Use the data.
Guidance on how to analyse data in Python is beyond the scope of this documentation.
You may find the following helpful:
There was a with the pyarrow.orc
module, which prevented it from working correctly. If you experience the same issue, try upgrading the module to a newer version that includes the fix.
You can learn more about Open Data Blend for Python including how to use it to ingest data into supported data lakes .
(suitable for small to medium-sized data files)
(suitable for data of any size, especially very large data files)
(suitable for data of any size, especially very large data files)