6
votes

As per title, I would like to request a calculation to a Spark cluster (local/HDInsight in Azure) and get the results back from a C# application.

I acknowledged the existence of Livy which I understand is a REST API application sitting on top of Spark to query it, and I have not found a standard C# API package. Is this the right tool for the job? Is it just missing a well known C# API?

The Spark cluster needs to access Azure Cosmos DB, therefore I need to be able to submit a job including the connector jar library (or its path on the cluster driver) in order for Spark to read data from Cosmos.

4
Have you checked Mobius? - T. Gawęda
@T.Gawęda I have indeed, but it looks to me more a way of writing Spark jobs in C# rather than an API for invoking and getting results. Does it have this use, too? - Stefano d'Antonio
I don't know to be honest. I only recognized Spark + C# = Mobius from Databricks post ;) - T. Gawęda
Thanks anyway, I might dig a bit more, but looks like it's just bindings to create jobs. @T.Gawęda - Stefano d'Antonio

4 Answers

4
votes

As a .NET Spark connector to query data did not seem to exist I wrote one

https://github.com/UnoSD/SparkSharp

It is just a quick implementation, but it does have also a way of querying Cosmos DB using Spark SQL

It's just a C# client for Livy but it should be more than enough.

using (var client = new HdInsightClient("clusterName", "admin", "password"))
using (var session = await client.CreateSessionAsync(config))
{
    var sum = await session.ExecuteStatementAsync<int>("val res = 1 + 1\nprintln(res)");

    const string sql = "SELECT id, SUM(json.total) AS total FROM cosmos GROUP BY id";

    var cosmos = await session.ExecuteCosmosDbSparkSqlQueryAsync<IEnumerable<Result>>
    (
        "cosmosName",
        "cosmosKey",
        "cosmosDatabase",
        "cosmosCollection",
        "cosmosPreferredRegions",
        sql
    );
}
2
votes

If your just looking for a way to query your spark cluster using SparkSql then this is a way to do it from C#:

https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs

The console app requires an ODBC driver installed. You can find that here:

https://www.microsoft.com/en-us/download/details.aspx?id=49883

Also the console app has a bug: add this line to the code after the part where the connection string is generated. Immediately after this line:

connectionString = GetDefaultConnectionString();

Add this line

connectionString = connectionString + "DSN=Sample Microsoft Spark DSN";

If you change the name of the DSN when you install the spark ODBC Driver you will need to change the name in the above line then.

Since you need to access data from Cosmos DB, you could open a Jupyter Notebook on your cluster and ingest data into spark (create a permanent table of your data there) and then use this console app/your c# app to query that data.

If you have a spark job written in scala/python and need to submit it from a C# app then I guess LIVY is the best way to go. I am unsure if Mobius supports that.

0
votes

Microsoft just released a dataframe based .NET support for Apache Spark via the .NET Foundation OSS. See http://dot.net/spark and http://github.com/dotnet/spark for more details. It is now available in HDInsight per default if you select the correct HDP/Spark version (currently 3.6 and 2.3, soon others as well).

-1
votes

UPDATE:

Long ago I said a clear no to this question. However times has changed and Microsoft made an effort. Pleas check out https://dotnet.microsoft.com/apps/data/spark

https://github.com/dotnet/spark

    // Create a Spark session
    var spark = SparkSession
    .Builder()
    .AppName("word_count_sample")
    .GetOrCreate();

Writing spark applications in C# now is that easy!

OUTDATED:

No, C# is not the tool you should choose if you would like to work with Spark! However if you really want to do the job with it try as mentioned above Mobius https://github.com/Microsoft/Mobius

Spark has 4 main languages and API-s for them: Scala, Java, Python, R. If you are looking for a language in production I would not suggest the R API. The Other 3 work well.

For Cosmo DB connection I would suggest: https://github.com/Azure/azure-cosmosdb-spark