0
votes

I am trying to read a table on postgres db and load the data into a Hive table on HDFS as below:

val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","source_system_name")
.option("lowerBound", 10000).option("upperBound", 50000)
.option("numPartitions",12).load()

My job is failing with the message: Container exited with a non-zero exit code 143 & GC Overhead issue.

The data is huge and causing skew in partitions. After observing the data in the table, I understood that 'source_system_name' is not the right column to partition the data on. But I have found that a combination of three other columns: source_system_name, org_code, period which are best suited to split the data while reading and processing. (I did a 'group by' to find out count of rows in the table and the data is a bit evenly spread across each group) But I don't know how can I use all three columns in the spark-jdbc's partitionColumn option. Could anyone let me know if it is possible to use multiple columns in the option: partitionColumn while reading data from an RDBMS table.

1
Is this a one off load or something you will need to repeat many times? - Terry Dactyl
I have to run it adhoc. Whenever required but I see this will be used more if the implementation is successful. - Metadata
Can you alter your source table to provide a unique incrementing integer column? If so job done. - Terry Dactyl
There is a column: forecast_id which a unique integer column. But there are millions of rows in the table and how can I define the partitionColumn as forecast and decide the lowerBound, upperBound, partitionColumn using forecast_id. - Metadata
I don't think uniqueness is enough to eliminate the possibility of skew. You need to make sure they relevant values are fairly evenly distributed. If you use an increasing integer field you know the first value is zero so a simple select statement is all you need to determine the upper bound. - Terry Dactyl

1 Answers

2
votes

No, the following applies:

  • For partitionColumn, lowerBound, upperBound - these options must all be specified if any of them is specified. In addition, numPartitions must be specified.

  • They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. This latter implies it can only be 1 column.

  • lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.