How to read selected columns while reading csv file in pyspark dataframe?

Question

I am trying to read selected columns while reading the csv file. Suppose csv file has 10 columns but I want to read only 5 columns. Is there any way to do this?

Pandas we can use usecols but is there any option available in pyspark also?

Pandas :

df=pd.read_csv(file_path,usecols=[1,2],index_col=0)

Pyspark :

Does this answer your question? How to read specific column in pyspark? — blackbishop

mck mck · Accepted Answer · 2021-03-04T07:12:28

If you want to read the first 5 columns, you can select the first 5 columns after reading the whole CSV file:

df = spark.read.csv(file_path, header=True)
df2 = df.select(df.columns[:5])

How to read selected columns while reading csv file in pyspark dataframe?

1 Answers