1
votes

I'm attemptiing to use pyspark to create an external table.

My code is similar to the following:

query="""
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
(
col1  STRING,
col2  STRING,
col3  STRING,
col4  STRING,
... 
... a further 600+ columns here
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
   'separatorChar' = ',',
   'quoteChar' = '"',
   'escapeChar' = '\\'
)  
LOCATION 's3://mybucket/myfolder/'
"""
spark.sql(query)

When I run the above code I get the following error message

u'org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0;'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0;'

However, when I take the exact same CREATE EXTERNAL TABLE statement and run it manually in the Athena query editor it works just fine. I suspect the issue is with the quotes/doublequotes in the SERDEPROPERTIES section but I have tried numerous combinations of them to no avail. if anyone has a working Spark query that uses a similar OpenCSVSerde definition in a create external table statement I would appreciate their input. Spark version I'm using is 2.4.3 with python 3.something

I should have said that my input file is a CSV with a mixture of text and numeric fields, with the text fields enclosed in double quote characters.

1

1 Answers

0
votes

Looking at the docs more closely for the OpenCSVSerde it seems that the double quote is the default quote character. In other words if you don't specify it explicitly, its assumed. I figured that the same would be true for the separator character ( i.e it would default to be a comma ) and in the end I just removed the whole

WITH SERDEPROPERTIES
(
   'separatorChar' = ',',
   'quoteChar' = '"',
   'escapeChar' = '\\'
)  

section from my query and my spark query worked a treat