0
votes

I am developing an internal system that is intended to work very much like Google Docs. The main piece I am implementing mimics their web-based Spreadsheet implementation. For multiple reasons I am not able to use Google Docs or ZK, which has a very robust Spreadsheet API. I chose POI 3.7 as a starting point for my Excel spreadsheet processing.

Currently when a user uploads an Excel spreadsheet, I take the file byte[] and store it in our db as a blob. When a user wants to view the spreadsheet, I pull out the byte[], build the Workbook, and push it to the client UI for editing. The pushing to the UI isn't my concern. When a user makes edits to the spreadsheet, I push the edits to the server and store them on a stack and only apply the updates when the user presses the "save" button. On save, I pull the workbook back out of the database, make the changes and push the Workbook back to the db. That way, I don't keep it in memory. It's no surprise that all of this is pretty fast except for when multiple users start doing this, obviously exploding Workbooks eats memory as described in other posts here.

A user will only update one tab at a time, why should I need to open the entire workbook? When a user initially uploads an excel spreadsheet, can I pull out each Sheet, convert each to a byte[] and save each as an indiviaual "worksheet" db record? The POI Sheet has a protected, "#write(Stream)" method but I would not like to get into the business of re-compiling POI. I also would not like to explode every cell as a new db entry. Would you guys do this differently in the first place?

Backend is java/spring/jdbc. For internal reasons, these are the technologies I'm stuck using.

3
Welcome to SO, Eric! ".. I'm stuck using." <-- somehow that sounds very negative :-)Wivani

3 Answers

2
votes

Storing big binary blobs in the database is in itself not a good thing if performance is important. You would be much better off storing the workbooks on the disk.

I can only give you half an answer to your question and that is that you can read xslx (not xsl) files one sheet at a time using (http://poi.apache.org/apidocs/index.html?org/apache/poi/xssf/eventusermodel/XSSFReader.html) and that you can use a SAXParser to avoid holding each full sheet in memory. I don't think there is any way of saving it without creating a sheet object.


Warning Hack: One quick hack could be to use reflection to call the protected method. There is of course no guarantee that this will work in future versions of POI.

1
votes

With Excel files, some things are stored at the sheet level, but other bits are stored at the workbook level. As your user edits a sheet, while most of their changes will be on the sheet part, some bits will need to touch the workbook level entities, and for that you'll need the whole file.

You might want to take a look at how SharePoint does its collaborative editing, which allows several people using Excel to work on the same file much like google docs. All the SharePoint protocol documents are publicly available, and there was an event on the docs very recently for which videos and presentations should be online soon, keep an eye on the office interop blog for when they do. In the SharePoint docs you should find the details on how Microsoft chunks up an Excel file for collaborative editing, and there's something to be said for you doing the same!

0
votes

I would consider looking into saving the sheets as separate XML's in the database. If you store additional (meta)data about sheets belonging together in the database it shouldn't be too much hassle keeping them together. The reason behind using XML is that from Excel 2003 up spreadsheets can be saved as xml and therefor can easily be created by code as well.

If at one point you seem to be hitting too many walls with Apache POI, you could look into the OpenOffice API as well.