SSIS – Skipping rows and stripping subtotals in Excel

Saturday, October 27, 2007

I was working on an SSIS project to dump data from an Excel spreadsheet into a database table.  However, the Excel file happened to be a beautifully formatted, colorful, presentation of charts and graphs, subtotal rows and all sorts of interesting images.  The actual data I needed to dump didn’t even start until about 15 rows down and 6 rows across.  My mind started wandering on some custom component that would skip rows, strip subtotal columns and basically rewrite the entire spreadsheet.  But SSIS had better plans.


Thank goodness for the OpenRowset() property on the Excel Datasource component. This property allows you to specify the range to be considered by the datasource in the format Sheet1$B15:Z2000.  By specifying a range, I was able to ignore all the titles and data spread all over the sheet and concentrate just on my data range.  I thought skipping rows and ignoring headers / titles was going to add considerable hours to this project. I was wrong.

The next challenge was getting rid of the subtotal rows.  Subtotal rows were distinguished by a cell with the word “RESULT” and an empty cell next to it.  I added a Conditional Split using the ISNULL([expression]) to separate “real data rows” from “subtotal rows”.  After this, the import was straightforward into a OLEDB Destination.  

28 Comments

  1. Jonas says:

    Brilliant, I had the same problem with skipping over a few rows in the beginning. This helped a lot.

  2. Half Abude says:

    Great tip! It would have helped me a lot if I had found it by the time I was dealing with the same scenario… I’ve spend a lot of time learning how to read Excel files via Script Component, and I thinks that setting OpenRowSet property would be enough for almost all files. Well, at least it worth the learning.
    BTW, I’ve mentioned your post in my blog (it’s in Portuguese).

  3. http:// says:

    In the script task ReadOnlyVariables excelfile
    Readwritevariables-excel tables

    Imports System
    Imports System.Data
    Imports System.Math
    Imports Microsoft.SqlServer.Dts.Runtime
    Imports System.IO
    Imports System.Xml
    Imports ADODB
    Imports System.Data.OleDb
    Imports System.Data.SqlClient
    Public Class ScriptMain
    Public Sub Main()
    Dim excelFile As String
    Dim connectionString1 As String
    Dim connectionString2 As String
    Dim excelConnection As OleDbConnection
    Dim dbconnection As SqlConnection
    Dim tablesInFile As DataTable
    Dim tablesindb As DataTable
    Dim tableExcel As DataTable
    Dim tableCount As Integer = 0
    Dim tableDBcount As Integer = 0
    Dim tableInFile As DataRow
    Dim tableInDB As DataRow
    Dim currentTable As String
    Dim currentDBTable As String
    Dim tableIndex As Integer = 0
    Dim excelRecordSet As ADODB.Recordset
    Dim excelDataReader As OleDbDataReader
    Dim intSuccess As Integer
    Dim strTableName As DataTable

    Dim excelTables As String()

    excelFile = Dts.Variables(“excelfile”).Value.ToString

    connectionString1 = “Provider=Microsoft.Jet.OLEDB.4.0;” & _
    “Data Source=” & excelFile & _
    “;Extended Properties=Excel 8.0″
    excelConnection = New OleDbConnection(connectionString1)
    excelConnection.Open()

    connectionString2 = “Data Source=MEDLINK64BIT;Initial Catalog=RCMS_VALIDATION; Integrated Security=True;”
    dbconnection = New SqlConnection(connectionString2)
    dbconnection.Open()

    tablesInFile = excelConnection.GetSchema(“Tables”)
    tableCount = tablesInFile.Rows.Count
    ReDim excelTables(tableCount – 1)
    For Each tableInFile In tablesInFile.Rows
    currentTable = tableInFile.Item(“TABLE_NAME”).ToString
    excelTables(tableIndex) = currentTable
    tableIndex += 1
    Next
    Dts.Variables(“exceltables”).Value = excelTables
    Dim i As Integer
    For i = 0 To excelTables.Length – 1
    If excelTables(i).Substring(0, 6) = “SOURCE” And excelTables(i).Substring(excelTables(i).Length – 1, 1) = “$” Then
    Dim command As New OleDbCommand(“select * from [" & excelTables(i).ToString() & "A:AL]“, excelConnection)
    Dim rdr As OleDbDataReader = command.ExecuteReader
    Dim BulkCopy As New SqlBulkCopy(dbconnection)
    BulkCopy.DestinationTableName = “Op_PaymentMatch_SOURCE”
    BulkCopy.WriteToServer(rdr)
    End If
    Next
    Dts.TaskResult = Dts.Results.Success

    End Sub
    Private Function ReadVariable(ByVal varName As String) As Object

    Dim result As Object

    Try
    Dim vars As Variables
    Dts.VariableDispenser.LockForRead(varName)
    Dts.VariableDispenser.GetVariables(vars)

    Try
    result = vars(varName).Value
    Catch ex As Exception
    Throw ex
    Finally
    vars.Unlock()
    End Try

    Catch ex As Exception
    Throw ex
    End Try

    Return result

    End Function
    Private Sub WriteVariable(ByVal varName As String, ByVal varValue As Object)
    Try
    Dim vars As Variables
    Dts.VariableDispenser.LockForWrite(varName)
    Dts.VariableDispenser.GetVariables(vars)
    Try
    vars(varName).Value = varValue
    Catch ex As Exception
    Throw ex
    Finally
    vars.Unlock()
    End Try
    Catch ex As Exception
    Throw ex
    End Try
    End Sub

    End Class

  4. http:// says:

    I have a similar problem. I have a large text dataset of 220,000 lines and I want to import every 10th line into excel. How is this done?

  5. MikeBosch says:

    I would imagine you want to import the entire dataset first. Then, in a loop, keep a counter variable and everytime you go to the next line, increase that counter. When the counter reaches 10, import the line, then reset the counter to 0 and continue the loop.

  6. http:// says:

    Are you using Studio 2005 or 8? 5 dont’t take ‘sheet1@A9:Q21′

  7. MikeBosch says:

    This was using VS 2005. I haven’t tested it in VS 2008.

  8. http:// says:

    What about if the sheet name has a space in it (e.g. [New Sheet1]$. Ive tried every combination of brackets ([]) and single-quotes(”) that I can think of and SSIS cannot recognize the table (worksheet) with the range with a space in the name.

  9. http:// says:

    Thanks, this helped a lot

  10. http:// says:

    I have the same quaestion as Possum Jones. My sheet has a space in the name.

  11. http:// says:

    Set the data access mode to ‘SQL Command’

    and try something like this:

    SELECT F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15,F16,F17
    FROM
    [Sheet Name$A2:V2000]

  12. http:// says:

    Thanks for this information.This really help me out.But I am facing problem when I export data to excel using Data Conversion + Excel Destination.It add single quote at start of every cell data .Also enter key comes as charachter of data.
    Please help me how to resolve it…

  13. http:// says:

    Hi,

    This sounds great. But can anyone tell me how to use Openrowset. Should I change the row numbers in Properties window or is there any other way. When I change the row numbers in properties window I am getting an 0x80040E37 error.

    Please help me to solve this.

    Thank you.

  14. http:// says:

    it worked, earlier i have tried many other ways but unsuccessful, this worked

  15. http:// says:

    Is there a way to just skip the first five lines. I’m using the “Sheet1$B15:Z2000″ which works great. The problem I am having is I know where the data starts and how many columns but the number of rows I’ll be pulling is unknown. Is there a way to just specify only the first cell “Sheet1$B15:*”

  16. http:// says:

    I tried using a large number (4000) that would probably cover the upper bound of possible rows. And then used the table(0).row.count function to find the number of actual data rows returned but the count number didn’t always have the correct number of data rows. If I had 5 data rows the count would be 5. If I had 7 data rows the count would be 95

  17. http:// says:

    Thanks a lot….. this solved my prblm

  18. http:// says:

    Thanks – this worked out perfectly for what I was needing (skipping first 4 rows – and ignoring column A)

  19. http:// says:

    Great, thank ‘s a lot for this magic property

  20. http:// says:

    To answer a couple of the questions posted:

    1) To change the OpenRowSet property in SSIS, right click the OLEDB source and select “Show Advanced Editor”. Go to Component Properties tab. It’s the last property. Set it like this: MySheet$A1:X10

    2) To use an entire column, and not hve to count the records, use this syntax: MySheet$A5:X
    This example starts on row 5 and process all records in columns A thru X.

    Murray

  21. http:// says:

    How to skip TOP 2 OR 3 rows in Excel using c#

  22. http:// says:

    Murray,
    That is simply awsome! You saved my day.

  23. http:// says:

    This was an awesome site.. It really saved time versus some of the other things that people were suggesting to do…

  24. http:// says:

    Thanks a lot.

    This works fine with SSIS 2008 as well. I have just implemented the same on SQL 2008 and it works just fine!

  25. http:// says:

    Great tip Mike! I used this trick in an SSIS package to populate an excel template.

    Works like a charm!

    cheers,
    anupam

  26. http:// says:

    Just what I needed!

  27. David says:

    This was a very helpful post. Here are a couple of details I found that might help others. I’m using SQL Server 2008 and an Excel 2007 source document.

    - For those who are initially lost (like I was), OpenRowset is a property of the Excel Source data flow component. This means you can have several Excel Source components using the same Excel Connection Manager, to get multiple data sets (multiple sheets, for example) from the same Excel document.

    - You need to change OpenRowset in the Properties window for the Excel Source, not the pop-up Excel Source Editor. Once set, don’t touch the field in the Excel Source Editor, because it doesn’t understand the field range syntax and will revert to just a sheet name.

    - If your sheet name has spaces, you don’t need any quotes. Just include the space and it will work. For example: Sheet One$B5:G100

    - To load all the rows, I just used the maximum row count possible in Excel 2007, like this: Sheet1$B7:G1048576
    The Excel Source is smart enough to only load non-blank rows. However, it is not smart enough to distinguish between rows that contain data versus rows that contain only formatting (grid lines, for example). I use a Conditional Split to filter out rows where a required field is blank, and only process the non-blank rows.

    - If you have 64-bit SQL Server, while debugging in BIDS, be sure to go into the project properties and under Debugging, set Run64BitRuntime to False. You will need to run your package in 32-bit mode in your production environment as well. Google for more info on Excel and 64-bit issues.

  28. Nathan A. says:

    I tried to skip the first row of the spreadsheet by using this syntax:

    sheet1$A2:U

    The spreadsheet has just over 20,000 rows, but the package runs (no errors) and only imports 254 records. It stops at row 255 of the spreadsheet.

    Any idea what is happening to cause this?

Leave a Reply