Uncategorized
How to Convert CSV to Parquet Easily with Python on Linux Shell
How to Convert CSV to Parquet Easily with Python on Linux Shell
Python has some great capabilities to make data more manageable for however you’re using it. Parquet is a format which can help to shrink structured data file sizes.
I have created an easily useable Python script geared towards command line applications.
Install some dependencies:
pip install pandas fastparquet pyarrow
Create the following file, and give it a sensible name such as convertcsvtoparquet.py
import sys import pandas import datetime txt = str(sys.argv[1]) print(f'{datetime.datetime.now()} - Info - CSV to Parquet conversion - Starting File Name {txt}') if txt.split('.')[-1] != 'csv': print('Error - Exiting - Not a CSV file') sys.exit(0) print(f'{datetime.datetime.now()} - Info - Importing CSV') try: inputfile = pandas.read_csv(sys.argv[1]) except: print(f'{datetime.datetime.now()} - Error - Exiting - CSV import failed') sys.exit(0) print(f'{datetime.datetime.now()} - Info - Writing Parquet') outputfile = txt.split('.')[0] + '.parquet' inputfile.to_parquet(outputfile, compression='brotli') print(f'{datetime.datetime.now()} - Complete - {outputfile} Written')
In order to operate it you can download some dummy data such as from https://www.heycsv.com/csv-sample-data and give it a test.
wget https://www.dropbox.com/s/muvfojx14t8nwxl/1M-sample-users.csv?dl=0 --output-document=dummy.csv python3 convertcsvtoparquet.py dummy.csv zip -9 -j dummy.zip dummy.csv ls -lha
No Comment