4
votes

I'm working with xslx Excel file on the server side in C#. In a spreadsheet, say there are 15 columns (cells) total. In the rows of cells, some values are missing. So the first row is my header will properly have the 15 cells. But my data rows, some cells might have empty values, so Open XML has a "jagged" set of cells values. Row 1 will have the full 15 cells, Row 2 might have 13 cells since two of the values are empty. What! How do I map this data properly? It basically shifts everything to the left and my cell values are wrong. What am I missing? It seems like they are being "collapsed" in Open XML terminology.

WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
                IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
                string relationshipId = sheets.First().Id.Value;
                WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
                Worksheet workSheet = worksheetPart.Worksheet;
                SheetData sheetData = workSheet.GetFirstChild<SheetData>();
                IEnumerable<Row> rows = sheetData.Descendants<Row>();

CLARIFICATION: Here's another way to ask this question. What if I want to take the contents on an Excel file and put it into a DataTable. I want all the columns of data to line up correctly. How could I accomplish this?

This question is asked better than I here: reading Excel Open XML is ignoring blank cells

2
MS article somewhat on the topic: msdn.microsoft.com/en-us/library/…Shane
Can you elaborate? Are you modifying an existing xlsx file and the output is wrong? Are you trying to read data from an existing xlsx file and the OpenXML cells aren't what you expect? Having trouble understanding the problem, and what your code is trying to accomplish.Sapph
Trying to read in an XLSX file and do work with it. For example, this: foreach (Row row in rows) But in row one, it has the proper 15 cells, row 2, has 13 cells only since in my Excel file, two are just empty strings. So it shifts all the data in Row2 over. Now all the cells are not mapped properly, so for Row 2, Cell 6 for example, it really is the value of Cell 7, which isn't right. Hard to explain i suppose.Shane

2 Answers

5
votes

As I understand it you are iterating over the Cells in a row and assuming that the first cell you read is in column A, the second in column B and so on?

I suggest that you (parse?) regex the column position / reference from the

DocumentFormat.OpenXml.Spreadsheet.Cell currentcell
currentcell.CellReference

CellReference gives you the cell reference in "A1" notation. Extract the Column part ("A,B,CC,etc.)

You would have to do this for each cell in a row. Then if a cell for a column is missing just fill in a placeholder value. Null or DbNull maybe?

I dont know if there is another way to figure out to what column a cell belongs.

4
votes

One way you can accomplish what you want is to figure out the largest column index in all your rows and then fill in all the empty cell values will blanks. This will keep all your columns lined up properly.

Here is a quick snippet to figure out the largest column index:

int? biggestColumnIndex = 0;
foreach (Row row in rows) 
{
   if (row.Descendants<Cell>().Any())
   {
      // Figure out the if this row has a bigger column index than the previous rows
      int? columnIndex = GetColumnIndexFromName(((Cell)(row.LastChild)).CellReference);
      biggestColumnIndex = columnIndex.HasValue && columnIndex > biggestColumnIndex ?  columnIndex : biggestColumnIndex;                   
   }
}

        /// <summary>
        /// Given just the column name (no row index), it will return the zero based column index.
        /// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ). 
        /// A length of three can be implemented when needed.
        /// </summary>
        /// <param name="columnName">Column Name (ie. A or AB)</param>
        /// <returns>Zero based index if the conversion was successful; otherwise null</returns>
        public static int? GetColumnIndexFromName(string columnName)
        {
            int? columnIndex = null;

            string[] colLetters = Regex.Split(columnName, "([A-Z]+)");
            colLetters = colLetters.Where(s => !string.IsNullOrEmpty(s)).ToArray();

            if (colLetters.Count() <= 2)
            {
                int index = 0;
                foreach (string col in colLetters)
                {
                    List<char> col1 = colLetters.ElementAt(index).ToCharArray().ToList();
                    int? indexValue = Letters.IndexOf(col1.ElementAt(index));

                    if (indexValue != -1)
                    {
                        // The first letter of a two digit column needs some extra calculations
                        if (index == 0 && colLetters.Count() == 2)
                        {
                            columnIndex = columnIndex == null ? (indexValue + 1) * 26 : columnIndex + ((indexValue + 1) * 26);
                        }
                        else
                        {
                            columnIndex = columnIndex == null ? indexValue : columnIndex + indexValue;
                        }
                    }

                    index++;
                }
            }

            return columnIndex;
        }

Then call the InsetCellsForCellRange method after you have the biggest column index to fill in all the empty cells with blank cells. Then read in your data and they should all line up. (All helper methods are below the InsetCellsForCellRange method)

/// <summary>
/// Inserts cells if required for a rectangular range of cells
/// </summary>
/// <param name="startCellReference">Upper left cell of the rectangle</param>
/// <param name="endCellReference">Lower right cell of the rectangle</param>
/// <param name="worksheetPart">Worksheet part to insert cells</param>
public static void InsertCellsForCellRange(string startCellReference, string endCellReference, WorksheetPart worksheetPart)
{
    uint startRow = GetRowIndex(startCellReference);
    uint endRow = GetRowIndex(endCellReference);
    string startColumn = GetColumnName(startCellReference);
    string endColumn = GetColumnName(endCellReference);

    // Insert the cells row by row if necessary
    for (uint currentRow = startRow; currentRow <= endRow; currentRow++)
    {
        string currentCell = startColumn + currentRow.ToString();
        string endCell = IncrementCellReference(endColumn + currentRow.ToString(), CellReferencePartEnum.Column);

        // Check to make sure all cells exist in the range; if not create them
        while (!currentCell.Equals(endCell))
        {
            if (GetCell(worksheetPart, currentCell) == null)
            {
                InsertCell(GetColumnName(currentCell), GetRowIndex(currentCell), worksheetPart);
            }

            // Move the reference to the next cell in the range
            currentCell = IncrementCellReference(currentCell, CellReferencePartEnum.Column);
        }
    }
}

        /// <summary>
        /// Given a cell name, parses the specified cell to get the row index.
        /// </summary>
        /// <param name="cellReference">Address of the cell (ie. B2)</param>
        /// <returns>Row Index (ie. 2)</returns>
        public static uint GetRowIndex(string cellReference)
        {
            // Create a regular expression to match the row index portion the cell name.
            Regex regex = new Regex(@"\d+");
            Match match = regex.Match(cellReference);

            return uint.Parse(match.Value);
        }



    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Create a regular expression to match the column name portion of the cell name.
        Regex regex = new Regex("[A-Za-z]+");
        Match match = regex.Match(cellReference);

        return match.Value;
    }

        /// <summary>
        /// Increments the reference of a given cell.  This reference comes from the CellReference property
        /// on a Cell.
        /// </summary>
        /// <param name="reference">reference string</param>
        /// <param name="cellRefPart">indicates what is to be incremented</param>
        /// <returns></returns>
        public static string IncrementCellReference(string reference, CellReferencePartEnum cellRefPart)
        {
            string newReference = reference;

            if (cellRefPart != CellReferencePartEnum.None && !String.IsNullOrEmpty(reference))
            {
                string[] parts = Regex.Split(reference, "([A-Z]+)");

                if (cellRefPart == CellReferencePartEnum.Column || cellRefPart == CellReferencePartEnum.Both)
                {
                    List<char> col = parts[1].ToCharArray().ToList();
                    bool needsIncrement = true;
                    int index = col.Count - 1;

                    do
                    {
                        // increment the last letter
                        col[index] = Letters[Letters.IndexOf(col[index]) + 1];

                        // if it is the last letter, then we need to roll it over to 'A'
                        if (col[index] == Letters[Letters.Count - 1])
                        {
                            col[index] = Letters[0];
                        }
                        else
                        {
                            needsIncrement = false;
                        }

                    } while (needsIncrement && --index >= 0);

                    // If true, then we need to add another letter to the mix. Initial value was something like "ZZ"
                    if (needsIncrement)
                    {
                        col.Add(Letters[0]);
                    }

                    parts[1] = new String(col.ToArray());
                }

                if (cellRefPart == CellReferencePartEnum.Row || cellRefPart == CellReferencePartEnum.Both)
                {
                    // Increment the row number. A reference is invalid without this componenet, so we assume it will always be present.
                    parts[2] = (int.Parse(parts[2]) + 1).ToString();
                }

                newReference = parts[1] + parts[2];
            }

            return newReference;
        }

        /// <summary>
        /// Returns a cell Object corresponding to a specifc address on the worksheet
        /// </summary>
        /// <param name="workSheetPart">WorkSheet to search for cell adress</param>
        /// <param name="cellAddress">Cell Address (ie. B2)</param>
        /// <returns>Cell Object</returns>
        public static Cell GetCell(WorksheetPart workSheetPart, string cellAddress)
        {
            return workSheetPart.Worksheet.Descendants<Cell>()
                                .Where(c => cellAddress.Equals(c.CellReference))
                                .SingleOrDefault();
        }

        /// <summary>
        /// Inserts a new cell at the specified colName and rowIndex. If a cell
        /// already exists, then the existing cell is returned.
        /// </summary>
        /// <param name="colName">Column Name</param>
        /// <param name="rowIndex">Row Index</param>
        /// <param name="worksheetPart">Worksheet Part</param>
        /// <returns>Inserted Cell</returns>
        public static Cell InsertCell(string colName, uint rowIndex, WorksheetPart worksheetPart)
        {
            return InsertCell(colName, rowIndex, worksheetPart, null);
        }

        /// <summary>
        /// Inserts a new cell at the specified colName and rowIndex. If a cell
        /// already exists, then the existing cells are shifted to the right.
        /// </summary>
        /// <param name="colName">Column Name</param>
        /// <param name="rowIndex">Row Index</param>
        /// <param name="worksheetPart">Worksheet Part</param>
        /// <param name="cell"></param>
        /// <returns>Inserted Cell</returns>
        public static Cell InsertCell(string colName, uint rowIndex, WorksheetPart worksheetPart, Cell insertCell)
        {
            Worksheet worksheet = worksheetPart.Worksheet;
            SheetData sheetData = worksheet.GetFirstChild<SheetData>();
            string insertReference = colName + rowIndex;

            // If the worksheet does not contain a row with the specified row index, insert one.
            Row row;
            if (sheetData.Elements<Row>().Where(r => r.RowIndex == rowIndex).Count() != 0)
            {
                row = sheetData.Elements<Row>().Where(r => r.RowIndex == rowIndex).First();
            }
            else
            {
                row = new Row() { RowIndex = rowIndex };
                sheetData.Append(row);
            }

            Cell retCell = row.Elements<Cell>().FirstOrDefault(c => c.CellReference.Value == colName + rowIndex);
            // If retCell is not null and we are not inserting a new cell, then just skip everything and return the cell
            if (retCell != null)
            {
                // NOTE: if conditions are not combined because we want to skip the parent 'else when the outside 'if' is true.
                // if retCell is not null and we are inserting a new cell, then move all existing cells to the right.
                if (insertCell != null)
                {
                    // Get all the cells in the row with equal or higher column values than the one being inserted. 
                    // Add the cell to be inserted into the temp list and re-index all of the cells.
                    List<Cell> cells = row.Descendants<Cell>().Where(c => String.Compare(c.CellReference.Value, insertReference) >= 0).ToList();
                    cells.Insert(0, insertCell);
                    string cellReference = insertReference;

                    foreach (Cell cell in cells)
                    {
                        // Update the references for the rows cells.
                        cell.CellReference = new StringValue(cellReference);
                        IncrementCellReference(cellReference, CellReferencePartEnum.Column);
                    }

                    // actually insert the new cell into the row
                    retCell = row.InsertBefore(insertCell, retCell);  // at this point, retCell still points to the row that had the insertReference
                }
            }
            // Else retCell is null, this means no cell exists at the specified location so we need to put a new cell in that space.  
            // If a cell was passed into this method, then it will be inserted. If not, a new one will be inserted.
            else
            {
                // Cells must be in sequential order according to CellReference. Determine where to insert the new cell.
                // Sequencial order can't be string comparison order, has to be Excel order ("A", "B", ... "AA", "BB", etc)
                Cell refCell = null;
                foreach (Cell cell in row.Elements<Cell>())
                {
                    string cellColumn = Regex.Replace(cell.CellReference.Value, @"\d", "");
                    if (colName.Length <= cellColumn.Length && string.Compare(cell.CellReference.Value, insertReference, true) > 0)
                    {
                        refCell = cell;
                        break;
                    }
                }

                // Insert cell parameter is supplied, otherwise, create a new cell
                retCell = insertCell ?? new Cell() { CellReference = insertReference };
                row.InsertBefore(retCell, refCell);
            }

            return retCell;
        }

//Other missing pieces

public enum CellReferencePartEnum
    {
        None,
        Column,
        Row,
        Both
    }

 private static List<char> Letters = new List<char>() { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ' };