CSV files are normally parsed using a TextReader (a StreamReader) or simple File IO. (There is a TextFieldParser as well, and some third-party libraries.) This entails either iterating the file line-by-line or reading the entire file into a string array (ReadAllLines). For a large file this can be slow or problematic, and does not allow random access; that is, you cannot directly read specific lines of the file. (You can do this if you ReadAllLines, but a large file may not fit in memory.)
Grab the whole file?
Spoiler
For a reasonably-sized file there is nothing wrong with iterating the lines or loading the whole file into memory. Nevertheless, it is always good to be aware of different approaches (and their advantages).
A CSV is, or should be, a database-table: an ordering of rows and columns. Using a connection-string and populating a DataTable has these advantages:
- The SQL statement can include criteria to reduce the number of lines retrieved, and an ORDER BY clause to sort them
- You can specify the particular columns you need, ignoring the others
- Optionally, you can specify the data-types of the columns
- A DataReader can be used to Load the DataTable, which is very fast
- LINQ can be used to work with and filter the DataTable rows
- Different DataViews of the DataTable can be created, or the DefaultView can be filtered (DataView.RowFilter)
Wherever possible, specify only the columns that you need and use a WHERE clause.
What about an ORM?
With the CSV as a data-source, creating or using an object-relational mapping (ORM) can be considered. E.g., the Entity Framework. That is, you will operate against .NET objects (POCO), rather than executing statements against the database. This is the approach to consider for a full-blown application.
However, a lot of forum questions about CSVs, particularly large CSVs, are, I suspect, from people who are either building a small application, or just need to process the CSV as a one-off or sys-admin task. In which case, using an ORM is overkill and doesn't directly solve the problem of processing such large files.
Besides, for a full-blown application, data that needs to be shared should be in a formal database, rather than passing around large CSV files.
The File
I will try and attach the file that I am using. Here is a sample of the data it contains:
Quote
StaffNo,FirstName,LastName,Salary,Grade,Office,Department,Extn,Bonus,BonusRate,StartDate,EndDate
101,Cole,Russell,22000,6,London,Sales,5525,FALSE,0.03,1980-07-21,NULL
104,Charlotte,Hill,31500,4,Leeds,Sales,3175,TRUE,0.04,1980-12-01,NULL
105,Diego,Warren,54000,1,London,Admin,5579,TRUE,0.05,1981-03-30,NULL
113,Chloe,Hall,22000,6,London,Sales,5518,TRUE,0.03,1981-05-05,NULL
114,Melanie,Cruz,41000,2,Birmingham,Sales,4142,FALSE,0.05,1981-07-20,NULL
116,David,Garcia,32000,4,Birmingham,Admin,4128,TRUE,0.04,1981-10-19,NULL
119,Nevaeh,Pierce,26000,5,Birmingham,Accounts,4108,TRUE,0.03,1981-11-30,NULL
121,Makayla,Bryant,28000,5,London,Sales,5540,TRUE,0.03,1982-10-04,2010-03-31
122,Savannah,Lopez,41000,2,London,IT,5550,FALSE,0.05,1983-01-04,NULL
124,Landon,Watson,20000,6,Birmingham,Admin,4139,TRUE,0.03,1983-12-05,NULL
126,Carlos,Lawrence,28000,5,London,Sales,5537,TRUE,0.03,1984-04-16,NULL
128,Grace,Watson,41000,2,Birmingham,Admin,4119,TRUE,0.05,1984-06-04,NULL
129,Brady,Rose,40500,2,Birmingham,Admin,4123,TRUE,0.05,1984-12-31,NULL
130,Anna,Rose,37000,3,Birmingham,Admin,4125,TRUE,0.04,1985-01-07,NULL
131,Trinity,Gonzalez,25500,5,Leeds,Admin,3187,TRUE,0.03,1985-04-01,NULL
135,Samantha,Stevens,20000,6,Birmingham,Admin,4140,TRUE,0.03,1985-06-17,NULL
138,Gabriel,Rivera,32000,4,Birmingham,Accounts,4105,FALSE,0.04,1985-07-22,NULL
144,Morgan,Holmes,19500,6,Leeds,Admin,3179,TRUE,0.03,1985-12-02,NULL
145,Kaylee,Reyes,28000,5,London,Sales,5536,TRUE,0.03,1986-06-23,NULL
101,Cole,Russell,22000,6,London,Sales,5525,FALSE,0.03,1980-07-21,NULL
104,Charlotte,Hill,31500,4,Leeds,Sales,3175,TRUE,0.04,1980-12-01,NULL
105,Diego,Warren,54000,1,London,Admin,5579,TRUE,0.05,1981-03-30,NULL
113,Chloe,Hall,22000,6,London,Sales,5518,TRUE,0.03,1981-05-05,NULL
114,Melanie,Cruz,41000,2,Birmingham,Sales,4142,FALSE,0.05,1981-07-20,NULL
116,David,Garcia,32000,4,Birmingham,Admin,4128,TRUE,0.04,1981-10-19,NULL
119,Nevaeh,Pierce,26000,5,Birmingham,Accounts,4108,TRUE,0.03,1981-11-30,NULL
121,Makayla,Bryant,28000,5,London,Sales,5540,TRUE,0.03,1982-10-04,2010-03-31
122,Savannah,Lopez,41000,2,London,IT,5550,FALSE,0.05,1983-01-04,NULL
124,Landon,Watson,20000,6,Birmingham,Admin,4139,TRUE,0.03,1983-12-05,NULL
126,Carlos,Lawrence,28000,5,London,Sales,5537,TRUE,0.03,1984-04-16,NULL
128,Grace,Watson,41000,2,Birmingham,Admin,4119,TRUE,0.05,1984-06-04,NULL
129,Brady,Rose,40500,2,Birmingham,Admin,4123,TRUE,0.05,1984-12-31,NULL
130,Anna,Rose,37000,3,Birmingham,Admin,4125,TRUE,0.04,1985-01-07,NULL
131,Trinity,Gonzalez,25500,5,Leeds,Admin,3187,TRUE,0.03,1985-04-01,NULL
135,Samantha,Stevens,20000,6,Birmingham,Admin,4140,TRUE,0.03,1985-06-17,NULL
138,Gabriel,Rivera,32000,4,Birmingham,Accounts,4105,FALSE,0.04,1985-07-22,NULL
144,Morgan,Holmes,19500,6,Leeds,Admin,3179,TRUE,0.03,1985-12-02,NULL
145,Kaylee,Reyes,28000,5,London,Sales,5536,TRUE,0.03,1986-06-23,NULL
staff.zip (2.54K)
Number of downloads: 705
The Connection String
[I am using a Console Application.]
Option Explicit On
Imports System.Data.OleDb
Module Module1
Sub Main()
Dim sLocation = "C:\Users\Andrew\Documents\"
Dim sConn = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & sLocation & _
";Extended Properties=""text;HDR=Yes;FMT=Delimited"";"
Dim dt As New DataTable("Staff")
There are other providers that can be used. Check connection strings at connectionstrings.com.
For this process to work the CSV must be fully and consistently organised into rows and columns. This should be the case particularly as a lot of CSVs are initially output from some database. On this point, it should also be considered whether the originating database can be referenced directly, rather than assuming that the intermediate step of using a CSV is necessary.
DataTable and DataAdapter
Dim dt As New DataTable("Staff")
Using adapt As New OleDbDataAdapter("SELECT * FROM [staff.csv]", sConn)
adapt.Fill(dt)
End Using
A key point is that the SQL statement SELECT * FROM [staff.csv] can specify individual columns, use criteria and an ORDER BY clause. It could also use GROUP BY or aggregate functions; that is, you could obtain some SUMs or AVERAGEs without having to parse or load the entire file. Some examples:
SELECT * FROM [staff.csv] WHERE Salary > 25000 SELECT FirstName, LastName, Salary FROM [staff.csv] WHERE Salary > 25000 SELECT FirstName, LastName, Office FROM [staff.csv] WHERE Office = 'London' ORDER BY LastName, FirstName SELECT TOP 100 * FROM [staff.csv]
We aren't defining the columns of the DataTable, they will be created automatically and their names and data-types assumed. A section towards the end of this tutorial demonstates specifying the columns.
The DataAdapter enables reading, and writing-back, of the data. Although it is possible to write changes back to the file this is not recommended unless you have exclusive access to the file. If the file is in a shared environment then switching to a formal (multi-user) database is much preferable. Otherwise, you may need to revert to using file-IO or a StreamReader as mentioned at the beginning.
Assuming that we need only read-only access to the file I will use a (fast) DataReader in preference to a DataAdapter.
DataAdapters and DataReaders :MSDN
MSDN said:
A DataAdapter is used to retrieve data from a data source and populate tables within a DataSet. The DataAdapter also resolves changes made to the DataSet back to the data source. The DataAdapter uses the Connection object of the .NET Framework data provider to connect to a data source, and it uses Command objects to retrieve data from and resolve changes to the data source.
DataReader
MSDN said:
You can use the ADO.NET DataReader to retrieve a read-only, forward-only stream of data from a database. Results are returned as the query executes, and are stored in the network buffer on the client until you request them using the Read method of the DataReader. Using the DataReader can increase application performance both by retrieving data as soon as it is available, and (by default) storing only one row at a time in memory, reducing system overhead.
Dim sSql As String = "SELECT * FROM [staff.csv]"
Using conn As New OleDbConnection(sConn)
Dim cmd As OleDbCommand = New OleDbCommand(sSql, conn)
If conn.State <> ConnectionState.Open Then
conn.Open()
End If
Dim reader As OleDbDataReader = cmd.ExecuteReader() 'Commandbehavior.CloseConnection
dt.Load(reader)
Console.WriteLine("There are {0} rows.", dt.Rows.Count)
'There are 107 rows.
End Using
Commandbehavior.CloseConnection This argument isn't needed because the Using statement will dispose of the connection object.
If conn.State <> ConnectionState.Open Then I borrowed this piece of code but I don't see it used very often. We could just use conn.Open() and check for an Exception. OleDbConnection.Open Method :MSDN. (There should be some error-handling code added in either case.)
Examining the Data
Console.WriteLine(String.Join(",", dt.Rows(3).ItemArray()))
'113,Chloe,Hall,22000,6,London,Sales,5518,TRUE,0.03,05/05/1981 00:00:00,NULL
For x = 0 To 20
Console.WriteLine(String.Join(",", dt.Rows(x).ItemArray()))
Next
For x As Integer = 3 To 10
Console.WriteLine("{0} {1}", dt.Rows(x)("FirstName"), dt.Rows(x)("Office"))
Next
Console.WriteLine(dt.Columns("Salary").DataType.ToString) 'Int32
Console.WriteLine("There are {0} rows.", dt.Rows.Count)
A Single Value
If you only need a single value from the file, such as the total of a particular column, then use ExecuteScalar().
Using conn As New OleDbConnection(sConn)
If conn.State <> ConnectionState.Open Then
conn.Open()
End If
Dim cmd As OleDbCommand = New OleDbCommand()
cmd.Connection = conn
cmd.CommandText = "SELECT Sum(Salary) FROM [staff.csv]"
Dim total As Decimal = cmd.ExecuteScalar()
Console.WriteLine("Total Salary {0:C}", total)
End Using
LINQ Operators
Introduction to LINQ in Visual Basic :MSDN
Dim london1 = From peeps In dt
Where peeps("Office") = "London" AndAlso peeps("Salary") > 20000
Order By peeps("LastName")
For Each dude In london1
Console.WriteLine("{0} {1} {2:C}", dude("FirstName"), dude("LastName"), dude("Salary"))
Next
Console.WriteLine(london1.Count())'51
'Count() is an Enumerable method
Dim london2 = From peeps In dt
Where peeps.Field(Of String)("Office") = "London" AndAlso
peeps.Field(Of DateTime)("StartDate") < #12/31/2000# 'M/dd/yyyy
For Each dude In london2
Console.WriteLine("{0} {1} {2:C}", dude("FirstName"), dude("LastName"), dude("Salary"))
Next
Dim london3 = From peeps In dt
Where peeps.Field(Of String)("Office") = "London"
Select New With {.FullName = String.Format("{0} {1}", _
peeps.Field(Of String)("FirstName"), peeps.Field(Of String)("LastName")),
.Salary = peeps.Field(Of Int32)("Salary")} 'NOT Decimal (invalid cast)
For Each dude In london3
Console.WriteLine("{0} {1:C}", dude.FullName, dude.Salary)
Next
The Field(Of Type) method is useful for stating the data-type, and for handling null values. Generic Field :MSDN
I have a full LINQ tutorial sequence here, although it is in C#.
Defining the Columns
Wherever possible in .NET data-types should be explicitly defined; it is both less error-prone and more efficient to do so.
Dim dt As New DataTable("Staff")
dt.Columns.Add("StaffNo", Type.GetType("System.Int32"))
dt.Columns.Add("FirstName", Type.GetType("System.String"))
dt.Columns.Add("LastName", Type.GetType("System.String"))
dt.Columns.Add("Salary", Type.GetType("System.Decimal"))
dt.Columns.Add("Grade", Type.GetType("System.Int16"))
dt.Columns.Add("Office", Type.GetType("System.String"))
dt.Columns.Add("Department", Type.GetType("System.String"))
dt.Columns.Add("Extn", Type.GetType("System.Int16"))
dt.Columns.Add("Bonus", Type.GetType("System.Boolean"))
dt.Columns.Add("BonusRate", Type.GetType("System.Double"))
dt.Columns.Add("StartDate", Type.GetType("System.DateTime"))
dt.Columns.Add("EndDate", Type.GetType("System.DateTime"))
dt.Columns("EndDate").AllowDBNull = True
Dim sSql = "SELECT StaffNo, FirstName, LastName, Salary, Grade, Office, Department, Extn, Bonus, "
sSql &= "BonusRate, StartDate, IIF(EndDate = 'NULL',NULL,EndDate) AS [EndDate] FROM [staff.csv]"
(All of the previous code could work with these explicit column definitions, but you'll need to account for columns that aren't filled being NULL.)
You can, and should, also specify a primary key (if there is one):
dt.PrimaryKey = New DataColumn() {dt.Columns("StaffNo")}
(Indexes are applied to a DataView, not a DataTable.)
It is much better, and more efficient, to specify the columns in a SQL statement, rather than using *. However, in this case, it is essential, because the value NULL in the file is treated as a text-value 'NULL'. AllowDBNull alone is not sufficient to convert 'NULL' to NULL so the (Jet) IIF function is used to achieve this.
In many cases a missing value between commas ,, or double-quotes ,"", will be treated as NULL. This doesn't help with our file because EndDate is the last column.
I want to emphasise again that you should specify only the columns that you need, and use a WHERE clause to reduce the number of rows/lines returned.
DataViews and RowFilters
A DataTable has a DefaultView - a DataView - which you can filter using DataView.RowFilter. You can also create new DataViews of the DataTable which could, for example, be used to populate a DataGridView.
Creating a DataView :MSDN
Dim dv As DataView = New DataView(dt, "Office='London' AND Salary > 25000", _
"LastName", DataViewRowState.CurrentRows)
For Each row In dv
Console.WriteLine("{0} {1} {2} {3:C}", row("FirstName"), row("LastName"), _
row("Office"), row("Salary"))
Next
You don't have to work with a single DataTable, you can populate more than one table based on the same CSV file. You can Merge() two DataTables together, or use LINQ Union(). The main point of my tutorial is to encourage you to explore all the options before you decide to attempt to load an entire, huge, CSV file into memory, or to iterate through every line of the file.




MultiQuote




|