Wednesday, August 14, 2019

How to read an XLSX file in .net code natively, without any 3rd party tools


Much to my surprise, there is no native way to read xlsx files in C# or PowerShell. Recently I had to write code that grabbed values from an xlsx file on SharePoint and then include the values in an email that is programmatically sent out. I couldn't use 3rd party software, so this presented a challenge.

As it turns out, an xlsx file is actually a compressed file and it can be renamed to a .zip file and then the contents (mostly xmls) extracted and mapped to each other.

The code below will download an xlsx file to a local directory or open it from an local directory if the download is not needed and then it extracts the files inside and maps the xmls. I put comments at the top of the code, you need PowerShell 5.0 or later to leverage the extraction function. It also assumes the first sheet in the spreadsheet is named worksheet1, you'll have to pull that programmatically if that is not the case or change the hard coded name. The code maps the xml that carries a list of shared strings (to save space) to the worksheet xml that has a single numerical representation of that shared string inside of it, as well as a row and column delineator. Dates are tricky as Microsoft uses the number of days from January 1st, 1900 to the value as the actual numerical representation, as well as some formatting data in the XML, but if you are like me you only care about the value itself. There is some caveat where MS used 1900 as a leap year, even though it was not, google that if you are off by a day or two.

If you have more than the standard A to Z columns in use, you'll have to write extra code for that and some of the xml symbolic representations I didn't take the time to figure out what they were for as I didn't need to. If someone does, I'd appreciate a comment here explaining what they found.

If you use this code, all of yoru values should easily be accessible via $columnListOfTuples or $rowListOfTuples, which will include a row and column value, as well as the actual cell value.

This was a PITA so I hope this helps others.


<#

    $columnListOfTuples and $rowListOfTuples have everything mapped

    Each xlsx file is a compressed group of xml files into a zip file that is renamed .xlsx
    There is an xml of shared strings that are then represented by a single numerical value to save space
    There is an xml for each sheet in the spread sheet that has numerical references to the values in the shared strings xml
    To produce a virtual/programmatic representation of the spreadsheet, the code will read the sheet, then read the numerical reference than then cross reference that to the shared strings value
    I have no mapped out all functions and xmls in the compressed file, but this is enough to get you started
    It's only configured for columns A - Z, you'll have to write code for additional columsn
    You need powershell 5.0 or later to have the native unzip feature
    It assumes your first worksheet is named worksheet1
#>

cls;

#Region Read Xlsx file

    #Region Create temporary zip file and location

    $fileDate = [DateTime]::Now.ToShortDateString();
    $fileDate = $fileDate.Replace("/", "-");
    $fileTime = [DateTime]::Now.ToShortTimeString();
    $fileTime = $fileTime.Replace(":", ".");
    $fileTime = $fileTime.Replace(" ", ".");
    $fileDateTime = $fileDate + "_" + $fileTime;

    $networkFile = "http://[xxxxx].xlsx";
    $originalFile = "c:\temp\file.xlsx"; # file must exist and location must have writable rights
    #Invoke-WebRequest -Uri $networkFile -OutFile $originalFile -UseDefaultCredentials; # this is to download a network file locally, may not be needed

    $index = $originalFile.LastIndexOf("\");
    $originalFileDirectory = $originalFile.Remove($index);
    $tempCopiedFileName = $originalFileDirectory + "\tempXlsx-" + $fileDateTime + ".zip";
    $desinationPath = $originalFileDirectory + "\xlsxDecompressed";

    #Endregion /Create temporary zip file and location

    #Region Map all of the zip/xlsx file paths

    $relsXmlFile = $desinationPath + "\_rels\.rels";
    $item1XmlRelsXmlFile = $desinationPath + "\customXml\_rels\item1.xml.rels";
    $item2XmlRelsXmlFile = $desinationPath + "\customXml\_rels\item2.xml.rels";
    $item3XmlRelsXmlFile = $desinationPath + "\customXml\_rels\item3.xml.rels";
    $item1XmlFile = $desinationPath + "\customXml\item1.xml";
    $item2XmlFile = $desinationPath + "\customXml\item2.xml";
    $item3XmlFile = $desinationPath + "\customXml\item3.xml";
    $itemProps1XmlFile = $desinationPath + "\customXml\itemProps1.xml";
    $itemProps2XmlFile = $desinationPath + "\customXml\itemProps2.xml";
    $itemProps3XmlFile = $desinationPath + "\customXml\itemProps3.xml";
    $appXmlFile = $desinationPath + "\[filename]\docProps\app.xml";
    $coreXmlFile = $desinationPath + "\[filename]\docProps\core.xml";
    $customXmlFile = $desinationPath + "\[filename]\docProps\custom.xml";
    $workbookXmlRelsXmlFile = $desinationPath + "\xl\_rels\workbook.xml.rels";
    $theme1XmlFile = $desinationPath + "\xl\theme\theme1.xml";
    $sheet1XmlRelsXmlFile = $desinationPath + "\xl\worksheets|_rels|sheet1.xml.rels";
    $workSheet1XmlFile = $desinationPath + "\xl\worksheets\sheet1.xml";
    $sharedStringsXmlFile = $desinationPath + "\xl\sharedStrings.xml";
    $stylesXmlFile = $desinationPath + "\xl\styles.xml";
    $workbookXmlFile = $desinationPath + "\xl\workbook.xml";
    $contentTypesXmlFile = $desinationPath + "\[Content_Types].xml";

    #Endregion /Map all of the zip/xlsx file paths

    Copy-Item $originalFile -Destination $tempCopiedFileName -Recurse; # copy the xlsx file as a temporary zip file
    Expand-Archive -LiteralPath $tempCopiedFileName -DestinationPath $desinationPath -Force; # unzip the file
    [xml]$fileLines = Get-Content($sharedStringsXmlFile); # read the contents of the shared strings file

    #Region Map shared strings

    $arr_SharedStrings = New-Object System.Collections.Generic.List[String];
    for ($i = 0; $i -lt $fileLines.sst.si.Count; $i++ )
    {
        $line = $fileLines.sst.si[$i];
        $tuple = [tuple]::create([int]$i, [string]("`"" + $line.InnerText + "`"") );
        $arr_SharedStrings.Add($tuple); # map the shared strings to the representational number inside of the xml files
    }

    #Endregion /Map shared strings

    [xml]$fileLines = Get-Content($workSheet1XmlFile);
    $sheetData = $fileLines.worksheet.sheetData;

    #Region Map rows
    $arr_sheet1_rows = New-Object System.Collections.ArrayList; # this will have a list/tuple of each cell that makes up a row
    for ($i = 0; $i -lt $sheetData.row.Count; $i++ )
    {
        $arr_sheet1_individualRow = New-Object System.Collections.Generic.List[String];
        $rowElements = $sheetData.row[$i].c; #grab data of all cells in that row
        foreach ($element in $rowElements)
        {
            $elementsTuple = [tuple]::create([string]$element.r, [string]$element.s, [string]$element.v)
            $arr_sheet1_individualRow.Add($elementsTuple);      
        }

        $rowTuple = [Tuple]::create([int] $i+1, $arr_sheet1_individualRow);
        $arr_sheet1_rows.Add($rowTuple) | Out-Null;
    }

    $rowListOfTuples = New-Object System.Collections.ArrayList; # has row number, then column letter, then cell value
    foreach ($rowMapped in $arr_sheet1_rows)
    {
        $cellValues = $rowMapped.Item2;
        $rowNumber = $rowMapped.Item1;
        foreach ($cellValue in $cellValues)
        {
            $cellValue = $cellValue.Remove(0, 1);
            $cellValue = $cellValue.Remove($cellValue.Length - 1, 1);
            $cellValue = $cellValue.Replace(" ", "");
            $cellValuesSplit = $cellValue.Split(",");
            $columnLetter = $cellValuesSplit[0];
            $columnLetter = $columnLetter -replace '[0-9]',''
            $cellValueOnly = $cellValuesSplit[2];

            $cellTuple = [tuple]::create([int]$rowNumber, [string]$columnLetter, [string]$cellValueOnly);
            $rowListOfTuples.Add($cellTuple) | Out-Null;
        }
    }

    #Endregion /Map shared rows

    #Region Map shared columns

    $rowCount = $arr_sheet1_rows.Count;
    $sheetDataCols = $fileLines.worksheet.cols;
    $arr_sheet1_columns = New-Object System.Collections.ArrayList; # this will have a list/tuple of each cell that makes up a column
    #$arr_sheet1_columns.Add("Place holder") | Out-Null;
    #for ($i = 0; $i -lt $sheetDataCols.col.Count; $i++ )
    for ($i = 0; $i -lt $rowCount; $i++ )
    {
        if ($arr_sheet1_columns.Count -lt $sheetDataCols.col.Count)
        {
            for ($j = 0; $j -lt $sheetDataCols.col.Count; $j++ )
            {
                switch  ($j)
                {
                    0 { $columnLetterTranslated = "A"; break; }
                    1 { $columnLetterTranslated = "B"; break; }
                    2 { $columnLetterTranslated = "C"; break; }
                    3 { $columnLetterTranslated = "D"; break; }
                    4 { $columnLetterTranslated = "E"; break; }
                    5 { $columnLetterTranslated = "F"; break; }
                    6 { $columnLetterTranslated = "G"; break; }
                    7 { $columnLetterTranslated = "H"; break; }
                    8 { $columnLetterTranslated = "I"; break; }
                    9 { $columnLetterTranslated = "J"; break; }
                    10 { $columnLetterTranslated = "K"; break; }
                    11 { $columnLetterTranslated = "L"; break; }
                    12 { $columnLetterTranslated = "M"; break; }
                    13 { $columnLetterTranslated = "N"; break; }
                    14 { $columnLetterTranslated = "O"; break; }
                    15 { $columnLetterTranslated = "P"; break; }
                    16 { $columnLetterTranslated = "Q"; break; }
                    17 { $columnLetterTranslated = "R"; break; }
                    18 { $columnLetterTranslated = "S"; break; }
                    19 { $columnLetterTranslated = "T"; break; }
                    20 { $columnLetterTranslated = "U"; break; }
                    21 { $columnLetterTranslated = "V"; break; }
                    22 { $columnLetterTranslated = "W"; break; }
                    23 { $columnLetterTranslated = "X"; break; }
                    24 { $columnLetterTranslated = "Y"; break; }
                    25 { $columnLetterTranslated = "Z"; break; }
                }
                $arr_sheet1_columns.Add($columnLetterTranslated) | Out-Null;
            }       
        }

        $rowElements = $sheetData.row[$i].c; #grab data of all cells in that row
        foreach ($element in $rowElements)
        {
            $columnLetter = $element.r -replace '[^a-zA-Z-]',''
            for ($k = 0; $k -lt $arr_sheet1_columns.Count; $k++)
            {
                $column = $arr_sheet1_columns[$k];
                if ($column.StartsWith($columnLetter))
                {
                    $columnTemp = $column.ToString() + "|" + $element.v;
                    $arr_sheet1_columns.Remove($column);
                    $arr_sheet1_columns.Insert($k, $columnTemp);
                    $stop = "here";
                }
            }
        }
    }

    $columnListOfTuples = New-Object System.Collections.ArrayList; # has column letter, than row number, then cell value
    foreach ($columnMapped in $arr_sheet1_columns)
    {
        $index = $columnMapped.IndexOf("|");
        $columnMappedLetter = $columnMapped.Remove($index);

        $columnIndividualValues = $columnMapped.Remove(0, $index);
        $columnIndividualValuesSplit = $columnIndividualValues.Split("|");

        $rowNumber = 0;
        foreach($columnIndividualValueSplit in $columnIndividualValuesSplit)
        {   
            $cellTuple = [tuple]::create([string]$columnMappedLetter, [int]$rowNumber, [string]$columnIndividualValueSplit);
            $columnListOfTuples.Add($cellTuple) | Out-Null;
            $rowNumber ++;
        }
    }

    #Endregion /Map shared columns



    # Here you have all of the data and you can parse it however you want, below is an example of that

    #Region Parse the extracted data

    $rowsForParsing = New-Object System.Collections.ArrayList;
    foreach ($arr_sheet1_row in $arr_sheet1_rows)
    {
        $rowNumber = $arr_sheet1_row.Item1;
        $cellValues = New-Object System.Collections.ArrayList; 
        $rowTemp = "";

        $indexer = 0;
        foreach ($rowValue in $arr_sheet1_row.Item2)
        {
            $valueSplit = $rowValue.Replace(" ", "");
            $valueSplit = $valueSplit.Replace("(", "");
            $valueSplit = $valueSplit.Replace(")", "");
            $valueSplit = $valueSplit.Split(",");

            $cellLocation = "";
            $cellLocation = $valueSplit[0];
            $cellSType = "";
            $cellSType = $valueSplit[1];
            $cellStringRepresentationValue = "";
            $cellStringRepresentationValue = $valueSplit[2];
            $translatedValue = "";

            if ($cellStringRepresentationValue -ne "")
            {
                switch ($cellSType)
                {
                    "1" { $translatedValue = getStringMappedValue($cellStringRepresentationValue); break;}  # represents column headers
                    "2" { $translatedValue = [datetime]::FromOADate($cellStringRepresentationValue); break;}  # number of days from January 1, 1900
                    "3" { $translatedValue = getStringMappedValue($cellStringRepresentationValue); break;}  # mapped value to strings table
                    "4" {$translatedValue = getStringMappedValue($cellStringRepresentationValue); break;}  # mapped value to strings table
                    "5" { $translatedValue = getStringMappedValue($cellStringRepresentationValue); break;}  # mapped value to strings table
                    "6" { Write-Host "#6: " $rowValue; break;}  # not sure what 6 represents yet
                    "7" { Write-Host "#7: " $rowValue; break;}  # not sure what 7 represents yet
                    "8" { Write-Host "#8: " $rowValue; break;}  # not sure what 8 represents yet
                    "9" { Write-Host "#9: " $rowValue; break;}  # not sure what 9 represents yet
                    "10" { Write-Host "#10: " $rowValue; break;}  # not sure what 10 represents yet
                    "11" { Write-Host "#11: " $rowValue; break;}  # not sure what 11 represents yet     
                }
            }

            if (($rowTemp -eq "") -and ($indexer -eq 0))
            {
                $rowTemp = "$translatedValue";
            }
            else
            {       
                $rowTemp = "$rowTemp" + "|" + "$translatedValue";
            }   

            $indexer++;
        }

        if ($rowNumber -eq 22)
        {
            $stop = "here";
        }

        $rowsForParsing.Add("$rowNumber|$rowTemp") | Out-Null;
    }

    $fileInformation = New-Object System.Collections.ArrayList;
    $topRow = $rowsForParsing[1];
    foreach ($rowParsed in $rowsForParsing)
    {
        $rowParsedSplit = $rowParsed.Split("|");    
        $date = $rowParsedSplit[1];

        $pass = $true;
        try
        {
            $dtVal = get-date $date;
        }
        catch
        {
            $pass = $false;
        }

        if ($pass -eq $true)
        {
            if ((get-date) -gt (get-date $date))
            {
                $dateApplicableRow = $rowParsed;
            }
            else
            {
                Write-Host "Top row is $topRow";
                Write-Host "This weeks row is $dateApplicableRow";
                $fileInformation.Add($topRow);
                $fileInformation.Add($dateApplicableRow);
                break;
            }
        }
    }

    #Endregion /Parse the extracted data

#EndRegion /Read Xlsx file

No comments:

Post a Comment