Using DataFrames created in SparkCLR with zeppelin queries

Question

I am new to the world of Java and Spark, I found out an impressive library for providing C# binding to Spark which allows us to use the C# to work with the SparkSQL.

I have some large amount of process data in one of my custom data store that has an ODBC and OPC interface. We would like to expose this data to Apache Spark so that we can run analytical queries on this data using tools like Apache Zeppelin

As there is no jdbc interface on my custom store, I was looking at creating c# code to pull the data from the custom data store using the available ODBC interface and provide it to spark using the historyDataFrame.RegisterTempTable("mydata");

I am able to create a sample and query it using the SQL from the C# sample, but what I am unable to understand is how can this be made available to spark such that i can work with tools like Apache Zeppelin.

Also what is the best way to load large amount of data in to SPARK SQL, trying to do something like this as in the sample may not work for loading over a million record.

    var rddPeople = SparkCLRSamples.SparkContext.Parallelize(
                            new List<object[]>
                            {
                                new object[] { "123", "Bill", 43, new object[]{ "Columbus", "Ohio" }, new string[]{ "Tel1", "Tel2" } },
                                new object[] { "456", "Steve", 34,  new object[]{ "Seattle", "Washington" }, new string[]{ "Tel3", "Tel4" } }
                            });

    var dataFramePeople = GetSqlContext().CreateDataFrame(rddPeople, schemaPeople);

hopping to get some pointers here to get this working.

skaarthik skaarthik · Accepted Answer · 2016-01-05T22:44:58

You could dump the data in csv format and let Spark/SparkCLR load that data for Spark SQL analysis. Loading the data from csv files will have the same result as parallelize in your code except that it will have much better performance. This approach will work for you if the data in your custom SQL source is append-only with no updates to existing data. If your custom source allows updates, the csv dump will go stale and you need a way to keep it fresh before doing analytics. An alternative is to explore if a JDBC-ODBC bridge can be employed to directly connect Spark SQL to your custom source obviating the need for dumping data in csv format.

Using DataFrames created in SparkCLR with zeppelin queries

1 Answers