I am using pyspark to load csv to redshift. I want to query how manny rows got added.
I create a new column using the withcolumn
function:
csvdata=df.withColumn("file_uploaded", lit("test"))
I see that this column gets created and I can query using psql. But, when I try to query using pyspark sql context, I get a error:
py4j.protocol.Py4JJavaError: An error occurred while calling o77.showString.
: java.sql.SQLException: [Amazon](500310) Invalid operation: column "test" does not exist in billingreports;
Interestingly, I am able to query other columnns but not just the new column I added.
Appreciate any pointers on how to resolve this issue.
Complete code:
df=spark.read.option("header","true").csv('/mnt/spark/redshift/umcompress/' +
filename)
csvdata=df.withColumn("fileuploaded", lit("test"))
countorig=csvdata.count()
## This executes without error
csvdata.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("dbtable", dbname) \
.option("tempformat", "CSV") \
.option("tempdir", "s3://" + s3_bucket + "/temp") \
.mode("append") \
.option("aws_iam_role", iam_role).save()
select="select count(*) from " + dbname + " where fileuploaded='test'"
## Error occurs
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("query", select) \
.option("tempdir", "s3://" + s3_bucket + "/test") \
.option("aws_iam_role", iam_role) \
.load()
newcounnt=df.count()
Thanks for responding. Dataframe does have the new column called file_uploaded Here is the query: select="select count(*) from billingreports where file_uploaded='test'"
I have printed the schema
|-- file_uploaded: string (nullable = true)
df.show() shows that the new column is added. I just want to add a pre determined string to this column as value.