Please help me with reading parquet files from remote HDFS i.e.; setup on Linux server using Dask or pyarrow in python?
Also suggest me if there are better ways to do the same other than the above two options.
Tried following code
from dask import dataframe as dd
df = dd.read_parquet('webhdfs://10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',engine='pyarrow',storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})
print(df)
Error is
KeyError: "Collision between inferred and specified storage options:\n- 'host'\n- 'port'"