Page 1 of 1

Load CSV to DataTable for Random Access Rate Topic: -----

#1 andrewsw  Icon User is online

  • I'm not here to twist your niblets
  • member icon

Reputation: 4033
  • View blog
  • Posts: 14,472
  • Joined: 12-December 12

Posted 09 January 2015 - 03:56 PM

This tutorial demonstrates filling a DataTable with data from a CSV file, using a DataReader, SQL statements, LINQ queries and a few other related topics.

CSV files are normally parsed using a TextReader (a StreamReader) or simple File IO. (There is a TextFieldParser as well, and some third-party libraries.) This entails either iterating the file line-by-line or reading the entire file into a string array (ReadAllLines). For a large file this can be slow or problematic, and does not allow random access; that is, you cannot directly read specific lines of the file. (You can do this if you ReadAllLines, but a large file may not fit in memory.)

Grab the whole file?
Spoiler

For a reasonably-sized file there is nothing wrong with iterating the lines or loading the whole file into memory. Nevertheless, it is always good to be aware of different approaches (and their advantages).



A CSV is, or should be, a database-table: an ordering of rows and columns. Using a connection-string and populating a DataTable has these advantages:

  • The SQL statement can include criteria to reduce the number of lines retrieved, and an ORDER BY clause to sort them
  • You can specify the particular columns you need, ignoring the others
  • Optionally, you can specify the data-types of the columns
  • A DataReader can be used to Load the DataTable, which is very fast
  • LINQ can be used to work with and filter the DataTable rows
  • Different DataViews of the DataTable can be created, or the DefaultView can be filtered (DataView.RowFilter)

Wherever possible, specify only the columns that you need and use a WHERE clause.

What about an ORM?

With the CSV as a data-source, creating or using an object-relational mapping (ORM) can be considered. E.g., the Entity Framework. That is, you will operate against .NET objects (POCO), rather than executing statements against the database. This is the approach to consider for a full-blown application.

However, a lot of forum questions about CSVs, particularly large CSVs, are, I suspect, from people who are either building a small application, or just need to process the CSV as a one-off or sys-admin task. In which case, using an ORM is overkill and doesn't directly solve the problem of processing such large files.

Besides, for a full-blown application, data that needs to be shared should be in a formal database, rather than passing around large CSV files.

The File

I will try and attach the file that I am using. Here is a sample of the data it contains:

Quote

StaffNo,FirstName,LastName,Salary,Grade,Office,Department,Extn,Bonus,BonusRate,StartDate,EndDate
101,Cole,Russell,22000,6,London,Sales,5525,FALSE,0.03,1980-07-21,NULL
104,Charlotte,Hill,31500,4,Leeds,Sales,3175,TRUE,0.04,1980-12-01,NULL
105,Diego,Warren,54000,1,London,Admin,5579,TRUE,0.05,1981-03-30,NULL
113,Chloe,Hall,22000,6,London,Sales,5518,TRUE,0.03,1981-05-05,NULL
114,Melanie,Cruz,41000,2,Birmingham,Sales,4142,FALSE,0.05,1981-07-20,NULL
116,David,Garcia,32000,4,Birmingham,Admin,4128,TRUE,0.04,1981-10-19,NULL
119,Nevaeh,Pierce,26000,5,Birmingham,Accounts,4108,TRUE,0.03,1981-11-30,NULL
121,Makayla,Bryant,28000,5,London,Sales,5540,TRUE,0.03,1982-10-04,2010-03-31
122,Savannah,Lopez,41000,2,London,IT,5550,FALSE,0.05,1983-01-04,NULL
124,Landon,Watson,20000,6,Birmingham,Admin,4139,TRUE,0.03,1983-12-05,NULL
126,Carlos,Lawrence,28000,5,London,Sales,5537,TRUE,0.03,1984-04-16,NULL
128,Grace,Watson,41000,2,Birmingham,Admin,4119,TRUE,0.05,1984-06-04,NULL
129,Brady,Rose,40500,2,Birmingham,Admin,4123,TRUE,0.05,1984-12-31,NULL
130,Anna,Rose,37000,3,Birmingham,Admin,4125,TRUE,0.04,1985-01-07,NULL
131,Trinity,Gonzalez,25500,5,Leeds,Admin,3187,TRUE,0.03,1985-04-01,NULL
135,Samantha,Stevens,20000,6,Birmingham,Admin,4140,TRUE,0.03,1985-06-17,NULL
138,Gabriel,Rivera,32000,4,Birmingham,Accounts,4105,FALSE,0.04,1985-07-22,NULL
144,Morgan,Holmes,19500,6,Leeds,Admin,3179,TRUE,0.03,1985-12-02,NULL
145,Kaylee,Reyes,28000,5,London,Sales,5536,TRUE,0.03,1986-06-23,NULL

Attached File  staff.zip (2.54K)
Number of downloads: 54

The Connection String

[I am using a Console Application.]
Option Explicit On

Imports System.Data.OleDb

Module Module1

    Sub Main()
        Dim sLocation = "C:\Users\Andrew\Documents\"
        Dim sConn = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & sLocation & _
            ";Extended Properties=""text;HDR=Yes;FMT=Delimited"";"

        Dim dt As New DataTable("Staff")


There are other providers that can be used. Check connection strings at connectionstrings.com.

For this process to work the CSV must be fully and consistently organised into rows and columns. This should be the case particularly as a lot of CSVs are initially output from some database. On this point, it should also be considered whether the originating database can be referenced directly, rather than assuming that the intermediate step of using a CSV is necessary.

DataTable and DataAdapter
    Dim dt As New DataTable("Staff")

    Using adapt As New OleDbDataAdapter("SELECT * FROM [staff.csv]", sConn)
        adapt.Fill(dt)
    End Using


A key point is that the SQL statement SELECT * FROM [staff.csv] can specify individual columns, use criteria and an ORDER BY clause. It could also use GROUP BY or aggregate functions; that is, you could obtain some SUMs or AVERAGEs without having to parse or load the entire file. Some examples:

SELECT * FROM [staff.csv] WHERE Salary > 25000
SELECT FirstName, LastName, Salary FROM [staff.csv] WHERE Salary > 25000
SELECT FirstName, LastName, Office FROM [staff.csv] WHERE Office = 'London' ORDER BY LastName, FirstName
SELECT TOP 100 * FROM [staff.csv]

We aren't defining the columns of the DataTable, they will be created automatically and their names and data-types assumed. A section towards the end of this tutorial demonstates specifying the columns.

The DataAdapter enables reading, and writing-back, of the data. Although it is possible to write changes back to the file this is not recommended unless you have exclusive access to the file. If the file is in a shared environment then switching to a formal (multi-user) database is much preferable. Otherwise, you may need to revert to using file-IO or a StreamReader as mentioned at the beginning.

Assuming that we need only read-only access to the file I will use a (fast) DataReader in preference to a DataAdapter.

DataAdapters and DataReaders :MSDN

MSDN said:

A DataAdapter is used to retrieve data from a data source and populate tables within a DataSet. The DataAdapter also resolves changes made to the DataSet back to the data source. The DataAdapter uses the Connection object of the .NET Framework data provider to connect to a data source, and it uses Command objects to retrieve data from and resolve changes to the data source.

DataReader

MSDN said:

You can use the ADO.NET DataReader to retrieve a read-only, forward-only stream of data from a database. Results are returned as the query executes, and are stored in the network buffer on the client until you request them using the Read method of the DataReader. Using the DataReader can increase application performance both by retrieving data as soon as it is available, and (by default) storing only one row at a time in memory, reducing system overhead.

    Dim sSql As String = "SELECT * FROM [staff.csv]"

    Using conn As New OleDbConnection(sConn)
        Dim cmd As OleDbCommand = New OleDbCommand(sSql, conn)
        If conn.State <> ConnectionState.Open Then
            conn.Open()
        End If
        Dim reader As OleDbDataReader = cmd.ExecuteReader()     'Commandbehavior.CloseConnection

        dt.Load(reader)
        Console.WriteLine("There are {0} rows.", dt.Rows.Count)
        'There are 107 rows.
    End Using


Commandbehavior.CloseConnection This argument isn't needed because the Using statement will dispose of the connection object.

If conn.State <> ConnectionState.Open Then I borrowed this piece of code but I don't see it used very often. We could just use conn.Open() and check for an Exception. OleDbConnection.Open Method :MSDN. (There should be some error-handling code added in either case.)

Examining the Data
    Console.WriteLine(String.Join(",", dt.Rows(3).ItemArray()))
    '113,Chloe,Hall,22000,6,London,Sales,5518,TRUE,0.03,05/05/1981 00:00:00,NULL

    For x = 0 To 20
        Console.WriteLine(String.Join(",", dt.Rows(x).ItemArray()))
    Next

    For x As Integer = 3 To 10
        Console.WriteLine("{0} {1}", dt.Rows(x)("FirstName"), dt.Rows(x)("Office"))
    Next

    Console.WriteLine(dt.Columns("Salary").DataType.ToString)     'Int32

    Console.WriteLine("There are {0} rows.", dt.Rows.Count)


A Single Value

If you only need a single value from the file, such as the total of a particular column, then use ExecuteScalar().
    Using conn As New OleDbConnection(sConn)
        If conn.State <> ConnectionState.Open Then
            conn.Open()
        End If
        Dim cmd As OleDbCommand = New OleDbCommand()
        cmd.Connection = conn
        cmd.CommandText = "SELECT Sum(Salary) FROM [staff.csv]"

        Dim total As Decimal = cmd.ExecuteScalar()
        Console.WriteLine("Total Salary {0:C}", total)
    End Using


LINQ Operators

Introduction to LINQ in Visual Basic :MSDN
    Dim london1 = From peeps In dt
                  Where peeps("Office") = "London" AndAlso peeps("Salary") > 20000
                  Order By peeps("LastName")

    For Each dude In london1
        Console.WriteLine("{0} {1} {2:C}", dude("FirstName"), dude("LastName"), dude("Salary"))
    Next
    Console.WriteLine(london1.Count())'51
    'Count() is an Enumerable method

    Dim london2 = From peeps In dt
                  Where peeps.Field(Of String)("Office") = "London" AndAlso
                  peeps.Field(Of DateTime)("StartDate") < #12/31/2000#      'M/dd/yyyy

    For Each dude In london2
        Console.WriteLine("{0} {1} {2:C}", dude("FirstName"), dude("LastName"), dude("Salary"))
    Next

    Dim london3 = From peeps In dt
                  Where peeps.Field(Of String)("Office") = "London"
                  Select New With {.FullName = String.Format("{0} {1}", _
                            peeps.Field(Of String)("FirstName"), peeps.Field(Of String)("LastName")),
                                   .Salary = peeps.Field(Of Int32)("Salary")}   'NOT Decimal (invalid cast)
    For Each dude In london3
        Console.WriteLine("{0} {1:C}", dude.FullName, dude.Salary)
    Next


The Field(Of Type) method is useful for stating the data-type, and for handling null values. Generic Field :MSDN

I have a full LINQ tutorial sequence here, although it is in C#.

Defining the Columns

Wherever possible in .NET data-types should be explicitly defined; it is both less error-prone and more efficient to do so.
        Dim dt As New DataTable("Staff")

        dt.Columns.Add("StaffNo", Type.GetType("System.Int32"))
        dt.Columns.Add("FirstName", Type.GetType("System.String"))
        dt.Columns.Add("LastName", Type.GetType("System.String"))
        dt.Columns.Add("Salary", Type.GetType("System.Decimal"))
        dt.Columns.Add("Grade", Type.GetType("System.Int16"))
        dt.Columns.Add("Office", Type.GetType("System.String"))
        dt.Columns.Add("Department", Type.GetType("System.String"))
        dt.Columns.Add("Extn", Type.GetType("System.Int16"))
        dt.Columns.Add("Bonus", Type.GetType("System.Boolean"))
        dt.Columns.Add("BonusRate", Type.GetType("System.Double"))
        dt.Columns.Add("StartDate", Type.GetType("System.DateTime"))
        dt.Columns.Add("EndDate", Type.GetType("System.DateTime"))
        dt.Columns("EndDate").AllowDBNull = True

        Dim sSql = "SELECT StaffNo, FirstName, LastName, Salary, Grade, Office, Department, Extn, Bonus, "
        sSql &= "BonusRate, StartDate, IIF(EndDate = 'NULL',NULL,EndDate) AS [EndDate] FROM [staff.csv]"


(All of the previous code could work with these explicit column definitions, but you'll need to account for columns that aren't filled being NULL.)

You can, and should, also specify a primary key (if there is one):
dt.PrimaryKey = New DataColumn() {dt.Columns("StaffNo"))


(Indexes are applied to a DataView, not a DataTable.)

It is much better, and more efficient, to specify the columns in a SQL statement, rather than using *. However, in this case, it is essential, because the value NULL in the file is treated as a text-value 'NULL'. AllowDBNull alone is not sufficient to convert 'NULL' to NULL so the (Jet) IIF function is used to achieve this.

In many cases a missing value between commas ,, or double-quotes ,"", will be treated as NULL. This doesn't help with our file because EndDate is the last column.

I want to emphasise again that you should specify only the columns that you need, and use a WHERE clause to reduce the number of rows/lines returned.

DataViews and RowFilters

A DataTable has a DefaultView - a DataView - which you can filter using DataView.RowFilter. You can also create new DataViews of the DataTable which could, for example, be used to populate a DataGridView.

Creating a DataView :MSDN
    Dim dv As DataView = New DataView(dt, "Office='London' AND Salary > 25000", _
                                      "LastName", DataViewRowState.CurrentRows)

    For Each row In dv
        Console.WriteLine("{0} {1} {2} {3:C}", row("FirstName"), row("LastName"), _
                    row("Office"), row("Salary"))
    Next


You don't have to work with a single DataTable, you can populate more than one table based on the same CSV file. You can Merge() two DataTables together, or use LINQ Union(). The main point of my tutorial is to encourage you to explore all the options before you decide to attempt to load an entire, huge, CSV file into memory, or to iterate through every line of the file.

This post has been edited by andrewsw: 10 January 2015 - 11:43 AM


Is This A Good Question/Topic? 2
  • +

Replies To: Load CSV to DataTable for Random Access

#2 Cal-cium  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 2
  • Joined: 27-January 15

Posted 27 January 2015 - 11:35 AM

Thanks for the tutorial, really helpful and descriptive :)/>

But for some reason with the below code it keeps saying there are 214 rows and im not sure why it says this.

 Console.WriteLine("There are {0} rows.", dt.Rows.Count)
   'There are 107 rows.


Was This Post Helpful? 0
  • +
  • -

#3 andrewsw  Icon User is online

  • I'm not here to twist your niblets
  • member icon

Reputation: 4033
  • View blog
  • Posts: 14,472
  • Joined: 12-December 12

Posted 27 January 2015 - 11:56 AM

Thank you.

Maybe you filled the DataTable twice. I was commenting out the previous code as I progressed, it was not intended to run all the code in one go.

This post has been edited by andrewsw: 27 January 2015 - 11:59 AM

Was This Post Helpful? 0
  • +
  • -

#4 Cal-cium  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 2
  • Joined: 27-January 15

Posted 29 January 2015 - 02:11 AM

Oops, yeah your right, forgot to comment it out. Thanks
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1