Extracting raw data from a PowerPivot model using Python

Question

What seemed like a trivial task turned into a real nightmare when I had to read in some data from a PowerPivot model using Python. I believe I've researched this very well over the last couple of days but now I hit a brick wall and would appreciate some help from the Python/SSAS/ADO community.

Basically, all I want to do is programmatically access raw data stored in PowerPivot models - my idea was to connect to the underlying PowerPivot (i.e. MS Analysis Services) engine via one of the methods listed below, list the tables contained in the model, then extract the raw data from each table using a simple DAX query (something like EVALUATE (table_name)). Easy peasy, right? Well, maybe not.

0. Some Background Information

As you can see, I've tried several different approaches. I'll try to document everything as carefully as possible so that those uninitiated in PowerPivot functionality will have a good idea of what I'd like to do.

First of all, some background on programmatic access to Analysis Services engine (it says 2005 SQL Server, but all of it ought to still be applicable): SQL Server Data Mining Programmability and Data providers used for Analysis Services connections.

The sample Excel/PowerPivot file I'll be using in the example below can be found here: Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples.

Also, note that I'm using Excel 2010, so some of my code is version-specific. E.g. wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection should be wb.Model.DataModelConnection.ModelConnection.ADOConnection if you're using Excel 2013.

The connection string I'll be using throughout this question is based on the information found here: Connect to PowerPivot engine with C#. Additionally, some of the methods apparently require some sort of initialization of the PowerPivot model prior to data retrieval. See here: Automating PowerPivot Refresh operation from VBA.

Finally, here's a couple of links showing that this should be achievable (note however, that these links mainly refer to C#, not Python):

Made connection to PowerPivot DataModel, how can I fill a dataset with it?
Connecting to PowerPivot with C#
2013 C# connection to PowerPivot DataModel
Connecting Tableau and PowerPivot. It just works. (showing that external apps can in fact read PowerPivot model data - note that the Tableau add-in installs Interop.ADODB.dll assembly, which I guess is what it uses to access the PowerPivot data)

1. Using ADOMD

import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()

Here, it appears the problem is that the PowerPivot model has not been initialized:

AdomdConnectionException: A connection cannot be made. Ensure that the server is running.

2. Using AMO

import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = AMO.Server()
Connection.Connect(ConnString)

Same story, "the server is not running":

ConnectionException: A connection cannot be made. Ensure that the server is running.

Note that AMO is technically not used for querying data, but I included it as one of the potential ways of connecting to the PowerPivot model.

3. Using ADO.NET

import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()

This is similar to What's the simplest way to access mssql with python or ironpython?. Unfortunately, this also doesn't work:

OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.

4. Using ADO via adodbapi module

import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = adodbapi.connect(ConnString)

Similar to Opposite Workings of OLEDB/ODBC between Python and MS Access VBA. The error I get is:

OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred:  The requested name is valid, but no data of the requested
type was found...

This is basically the same problem as with ADO.NET above.

5. Using ADO via Excel/win32com module

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')

Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)

The idea for this approach came from this blog post that uses VBA: Export a table or DAX query from Power Pivot to CSV using VBA. Note that this approach uses an explicit Refresh command that initializes the model (i.e. "server"). Here's the error message:

com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)

It appears, however, that the ADO connection has been established:

type(Connection) returns instance
print(Connection) returns Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue

It seems the problem lies in the creation of the ADODB.Recordset object.

6. Using ADO via Excel/win32com, direct use of ADODB.Connection

from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)

Similar to Connection to Access from Python [duplicate] and Query access using ADO in Win32 platform (Python recipe). Unfortunately, the error Python spits out is the same as in the two examples above:

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred:  The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)

7. Using ADO via Excel/win32com, direct use of ADODB.Connection plus model refresh

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
                     Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
                     Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
                     Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)

I was hoping I could initialize an instance of Excel, then initialize the PowerPivot model, and then create a connection using the internal connection string Excel uses for embedded PowerPivot data (similar to How do you copy the powerpivot data into the excel workbook as a table? - note that the connection string is different from the one I've used elsewhere). Unfortunately, this doesn't work and my guess is that Python starts the ADODB.Connection process in a separate instance (as I get the same error message when I execute the last three rows without first initializing Excel, etc.):

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)

May I ask why you need to access the data in the PP model directly? Power Pivot data must be loaded from some other source, whether it be a data source that exposes a more robust ODBC interface (or other programmatic connectivity) or flat files that you could consume natively in Pyhon. — greggyb
I work in an industry where we offen get PowerPivot data from third parties. They're not our clients, so we can't request the original data in a more suitable format, etc. — akavalar
Understood. DAX Studio is capable of running arbitrary DAX queries against an open workbook. You might be able to dig into its internals or contact the maintainer, Darren Gosbell, for some more informed assistance on this: daxstudio.codeplex.com — greggyb
Thanks for the suggestion, @greggyb! I've already looked at DAX Studio, but since there's no notion of rows and row indices in PowerPivot/Vertipaq db engine, I don't think there's a way of limiting the DAX query to just a subset of records (I've done a fair share of research on this and I think this actually deserves an entirely new question). Anyway, this way one could get around Excel's 1m row limit by e.g. using a VBA script that would output 1m chunks of data into standalone workbooks (similar to what's suggested here: stackoverflow.com/a/33580647). — akavalar
I should note that splitting the table into smaller chunks is only needed when there's more than 1m records (for example, the dbo_FactSales table in the sample file mentioned above), but not when the number of rows is smaller than that (e.g., the dbo_DimProduct table). In that case, the solution suggested in that SO link is super straightforward. I will reach out to Darren though to see if there's anything I might've missed. Thanks again! — akavalar

akavalar akavalar · Accepted Answer · 2016-04-27T00:38:34

Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.

Install Microsoft Power BI Desktop (comes with free Analysis Services server, so no need for a costly SQL Server license - however, the same approach obviously also works if you have a proper license).
Fire up the AS engine by first creating the msmdsrv.ini settings file, then restore the database from the ABF file (using AMO.NET), then extract data using ADOMD.NET.

Here's the Python code that illustrates the AS engine + AMO.NET parts:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

And the data-extraction part:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

The raw data are then extracted via something like this:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)