0
votes

I am new with Pyspark in Databricks so thats why im struggeling with the following: I have a dataframe that has columns with datatype string. One example (see image below) is the column netto_resultaat.

The requirement is to change this column and other financial related columns from a string to a decimal. This because i want to export the schema and data in a Azure SQL database.

i tried the following:

from pyspark.sql.functions import col from pyspark.sql.types import DecimalType newouterjoindffinal = newouterjoindf.withColumn("netto_resultaat",col("netto_resultaat").cast(DecimalType())) newouterjoindffinal.printSchema()

The result is that the numbers in column netto_resultaat are converted as null. there are records that have data like -100,880 , 35,312,000 or 118,579,525.

Any suggestions?

enter image description here

1

1 Answers

2
votes

It is because you have comma in your numbers and you have to have dots instead.

from pyspark.sql import functions as F

df = pd.DataFrame({"netto_resultaat": ["100,800", "10,20", "20,342"]})
df = spark.createDataFrame(df)

df.show()

# output
+---------------+
|netto_resultaat|
+---------------+
|        100,800|
|          10,20|
|         20,342|
+---------------+

You have to first replace commas with dots and then apply cast:

(
    df
    .withColumn("netto_resultaat", 
                F.regexp_replace("netto_resultaat", ",", ".")
                .cast(DecimalType())
               )
    .show()
)

# output
+---------------+
|netto_resultaat|
+---------------+
|            101|
|             10|
|             20|
+---------------+

Or replace with empty string if those are thousands (I don't know your case exactly):

(
    df
    .withColumn("netto_resultaat", 
                F.regexp_replace("netto_resultaat", ",", "")
                .cast(DecimalType())
               )
    .show()
)

# output
+---------------+
|netto_resultaat|
+---------------+
|         100800|
|           1020|
|          20342|