1
votes

Using Spark 1.6.1. I have a bunch of tables in a mariaDb that I wish to convert to pySpark DataFrame objects. But createExternalTable() is throwing. For example:

In [292]: tn = sql.tableNames()[10]

In [293]: df = sql.createExternalTable(tn)


/home/charles/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)


306                 raise Py4JJavaError(
307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
309             else:
310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o18.createExternalTable.
: java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:379)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.SQLContext.createExternalTable(SQLContext.scala:695)
at org.apache.spark.sql.SQLContext.createExternalTable(SQLContext.scala:668)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

Same thing happens if I specify source='jdbc'.

The table exists:

In [297]: sql.sql("SELECT * from {} LIMIT 5".format(tn)).show()
+--------------------+-----+-----+-----+----+------+------+------+----------------------+----+----+----+-----+
|                Date| Open| High|  Low|Last|Change|Settle|Volume|Prev_Day_Open_Interest|prod|exch|year|month|
+--------------------+-----+-----+-----+----+------+------+------+----------------------+----+----+----+-----+
|1999-10-29 00:00:...|245.0|245.0|245.0|null|  null| 245.0|   1.0|                   1.0|   C| CME|2001|    H|
|1999-11-01 00:00:...|245.0|245.0|245.0|null|  null| 245.0|   0.0|                   1.0|   C| CME|2001|    H|
|1999-11-02 00:00:...|245.0|245.0|245.0|null|  null| 245.0|   0.0|                   1.0|   C| CME|2001|    H|
|1999-11-03 00:00:...|245.0|245.5|245.0|null|  null| 245.5|   5.0|                   6.0|   C| CME|2001|    H|
|1999-11-04 00:00:...|245.5|245.5|245.5|null|  null| 245.5|   0.0|                   6.0|   C| CME|2001|    H|
+--------------------+-----+-----+-----+----+------+------+------+----------------------+----+----+----+-----+

According to the error, this should work for HIVE data. I'm not using a HIVEContext, but an SQLContext. According to https://spark.apache.org/docs/latest/api/python/pyspark.sql.html this is supported for ver >= 1.3.

Is there a way to extract a DataFrame from a SqlTable?

1
Could explain what exactly are you trying to achieve? It looks like you already have registered table with that name. Where are other required options.The way you use createExternalTable wouldn't be valid even if it wasn't for an error you see. - zero323
A library registered the tables for me. Now that they are registered, I wish to manipulate as pySpark DataFrames directly. - Charles Pehlivanian

1 Answers

1
votes

Given a description what you want here is not createExternalTable which is used to manage Hive tables but simple table:

df = sqlContext.table(tn)

or to assign the result of sql call:

df = sqlContext.sql("SELECT * from {}".format(tn))