5
votes

I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv files into Spark DataFrames.

Everything works but all columns are assumed to be of StringType.

As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), for built-in sources such as JSON, the schema with data types can be inferred automatically.

Can the types of columns in CSV file be inferred automatically?

2
1. StringTypes are a field type in SparkSQL. 2. What you are asking is not very clear, can you be more specific about what you are trying to achieveeliasah
I'm asking about automatic type inference, which is available in built-in data sources such as JSON. I.e. if one creates df using sqlContext.jsonFile("...") from json file having say one integer and one string field - these types would be defined in schema. Is this possible with CSV data source format?Oleg Shirokikh

2 Answers

7
votes

Starting from Spark 2 we can use option 'inferSchema' like this: getSparkSession().read().option("inferSchema", "true").csv("YOUR_CSV_PATH")

3
votes

Unfortunately this is not currently supported but it would be a very useful feature. Currently they must be declared in DLL. From the documentation we have:

header: when set to true the first line of files will be used to name columns and will not be included in data. All types will be assumed string. Default value is false.

which is what you are seeing.

Note that it is possible to infer schema at query time, e.g.

select sum(mystringfield) from mytable