Page 1 of 1

Tutorial On Parsing CSV File Parsing CSV File

#1 Crazy_Learner  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 10
  • View blog
  • Posts: 145
  • Joined: 25-July 09

Posted 07 September 2009 - 01:54 PM

Today I am going to show you how to create a parser, specifically for Comma Separated Value Files, Or CSV Files. I am going to build this parser based off of the standardizing rules of RCF 4180 .

Step 1: Understanding What A Parser Is
A parser in plain and simple terms, is a program that reads a file, splits it up into tokens in accordance with the rules of the item your parsing. (Technically,this can also be considered a lexer), but for this tutorial there is not a lexer, just a parser. A better definition would be: computer program that divides code up into functional components.

Step 2: Creating The Code
Parsers are usually a intermediate or advanced level program, but in this case, this parser is beginner material, because no real advanced functions are used. The first thing that your going to do is create 1 file, which will contain ALL of the code.

using System;// Basic Import Statement
using System.Collections.Generic; // Allows Us To Use Lists
using System.IO; // For File handles



I am sure that most of you understand this portion of the code, its just the import statements AKA include statements AKA using statements. Now Create a namespace for this program, i will make mine CommaSeperatedValueDocument and create your first class as seen below

namespace CommaSeperatedValueDocument
{
		public class CommaSeperatedValue
		{

		}
}



Now That The Main Structure Of the program is laid out we create the constructor methods. I have only one, but you can add more if you so desire later. The one constructor method that this class holds is to pass the value of the string onto the actual method to parse it. in this case the string is the path to the file.

// Just The Constructor, not that this is actually inside
// the class section.
public CommaSeperatedValue(string path)
{
		// send this to the parser
		CommaSeperatedValueParser(path)
}



Now For the nit and gritty part of the code. We are going to create a method that returns a list of the tokens so that another program can simply access this list and have all of the data without parsing it again. The List capability comes from the System.Collections.Generic statement. So first we are going to create this method and add a new List in which to add the data too

public List<string> CommaSeperatedValueParser(string path)
{
		List<string> parsedData = new List<string>();
}



That code creates a list of strings, which are your tokens. Pretty Simple right? So far you should have a namespace, a class and the constructor method completed, and this method started.. should look like this:

namespace CommaSeperatedValueDocument
{
	public class CommaSeperatedValue
	{
			public CommaSeperatedValue(string path)
			{
				CommaSeperatedValueParser(path);
			}
			public List<string> CommaSeperatedValueParser(string path)
			{
				List<string> parsedData = new List<string>();
			}
	}
}



Now comes the parsing of the data, but first we must know what the rules of the CSV file are and all of the syntax. For example ';' is a comment for an entire line. In case you didn't catch the link from above: RCF 4180.

There are many ways in order to do this, you can read the entire document and do splitting after the read, which requires you to save the string in the memory or you can use a while or foreach loop while reading it line by line. Neither of these options in my opinion are "optimized" but for now we will read line by line. I have decided to use a foreach statement which would read every character. BUt before we hit that lets create a few variables.

// These are in the CommaSeperatedValueParser Method
int total_comments = 0;	 // Number of comments in CSV file
int total_blankLines = 0;   // Number Of blank lines in CSV File
bool tokenInQuotes = false;// is the token in quotes?
bool tokenContinued = true;// is the token continued?
int total_tokens = 0;	   // Number of tokens counted
string temp_println = "";   // Temp String For When The line is in quotes but breaks to new line



As the comments state these are to keep track of the current state of the character its reading. These will be used in our foreach loop becuase everything is read a character at a time. So lets start the loop of information (while and foreach)

Step 3: The Parser Itself:
try
{
	// READ THE FILE
	StreamReader readFile = new StreamReader(path);
	
	// READ THE LINE
	string readLine = null;
	
	// WHAT LINE TO PRINT
	string printLine = null;

	// MAKE SURE THE FILE IS NOT EMPTY
	while ((readLine = readFile.ReadLine()) != null)
	{



The above is pretty self-explanatory, basically though, we opened a file at the "path" location and created a while loop that reads the file till the file has no more lines available to read. Next we do preliminary checks. These checks are to see if there are any comments or any blank lines.

	   // Ignore Any Lines Starting With ';'
	if (readLine.StartsWith(";"))
	{
		printLine = null;
		total_comments = total_comments + 1;
			// Also Written total_comments++;
	}

	// If line is not comment line check if its blank
	else if (readLine.Trim() == null || readLine.Length == 0)
	{
		printLine = null;
		total_blankLines = total_blankLines + 1;
			//Also written total_blankLines++;
	}



The starts with is pretty self explanatory. The trim function removes any white-spaces that are on the current line. This does not actually perform a space trim simply emulates one, to check if there would be any other characters, after the check is done the spaces are returned (automatically- no code needed). So now we have determined if the file has comments at the beginning of the line or if the line is blank. Because these lines mean nothing to a CVS document when loaded into a program, these are ignored by are parser. now we create a foreach statement that separates the lines with characters into separate characters.

	// Check For Any Other Characters (Default Action)
	else if ((readLine.Trim() != null) && (!readLine.StartsWith(";")))
	{
		// Cycle Each Character
		foreach (char character in readLine)
			{



this starts the process of reading. Now according to the regulations we have to separate each token at "every" :crazy: comma, however, commas in quotes are excluded and have to remain in the actual token.So we create two if statements one with the comma problem and one for the quote problem

		// Split Tokens At The Commas
		if (character == ',')
		{
			if (tokenInQuotes == false)
			{
			total_tokens = total_tokens + 1;
			Console.WriteLine("  (*) " + printLine);
			printLine = null;
			tokenContinued = false;
			temp_println = null;
			}
			else if (tokenInQuotes == true)
			{
			total_tokens = total_tokens - 0;
			printLine += character;
			tokenContinued = true;
			}
			continue;
		}

		if (character == '\"')
		{
			// Check For Start Of Quotation
			if (character == '\"' && tokenInQuotes == false)
			{
			tokenInQuotes = true;
			printLine += character;
			tokenContinued = true;
			continue;
			}

			// Check for end of Quotations
			else if (tokenInQuotes == true && character == '\"')
			{
			tokenInQuotes = false;
			printLine += character;
			tokenContinued = false;
			continue;
			}
		}



the token in quotes checks to see if the quote mark has started or ended and adds them to the token. The token continued is so that every character that is registered from the foreach loop stays in the quotes and does not count as a new token if a comma comes in between them. This also allows for internal double quotes as defined by the rules.

the next step is to add the characters themselves to the printed line and to check for comments in the middle of the line
		// Check For Internal Comments
		if (character == ';')
		{
			total_comments = total_comments + 1;
			temp_println = printLine;
			printLine = null;
			printLine = temp_println;
			break;
		}

		// Handle all other characters
		if (character != ';' && character != '\"' && character != ',')
		{
			printLine += character;
			continue;
		}
		 }



For the comments section the line is copied up to the point where to comment starts, then the extra characters are removed by setting the printLine to null.the break is used at the comment because the rest of the line is a comment, so we need to move to the next line. But unlike the comment all characters need to read (unless in a comment) so we use the continue, so it will simply move to the next character.

Now we have to allow for line returns inside quotes to count as a single token, to do this we are going to the top of the foreach loop and adding
		
		/// right under for each deceleration
		if (tokenContinued == true)
		{
			temp_println = printLine;
			printLine = temp_println;
		}



the token continue checks if its in quotes, like above, but this registers with every new line to see if a quote was still open. now close the foreach statement and add this code:

		// Print tokens at the end of the line
		if (tokenContinued == false)
		{
		total_tokens = total_tokens + 1;
		Console.WriteLine("  (*) " + printLine);
		parsedData.Add(printLine);
		printLine = null;
		temp_println = null;
		}



this prevents adding a line break into the tokens. now to close the else and while, statements and create a reporter
					Console.WriteLine("");
					Console.WriteLine("File Stats: ");
					Console.WriteLine("  (*) File Contains " + total_comments + " Comments");
					Console.WriteLine("  (*) File Contains " + total_blankLines + " Blank Lines");
					Console.WriteLine("  (*) File Contains " + total_tokens + " tokens");
					Console.ReadLine();



and then close the try statement and create the exception clause

catch(Exception e)
{
	// Do Nothing
}



now the final step is to return the parsed data and close the method
return parsedData;



------------------------------------------------------Under This Line Is the entire source code ---------------------------------------------------------
using System;
using System.Collections.Generic;
using System.IO;

namespace CommaSeperatedValueDocument
{
public class CommaSeperatedValue
{
	public CommaSeperatedValue()
	{

	}
	public CommaSeperatedValue(string path)
	{
	CommaSeperatedValueParser(path);
	}
	public List<string> CommaSeperatedValueParser(string path)
	{
	List<string> parsedData = new List<string>();
	uint total_comments = 0;
	uint total_blankLines = 0;
	bool tokenInQuotes = false;
	bool tokenContinued = true;
	uint total_tokens = 0;
	string temp_println = "";

	try
	{
		StreamReader readFile = new StreamReader(path);
		string readLine = null;
		string printLine = null;

		while ((readLine = readFile.ReadLine()) != null)
		{
		// Ignore Any Lines Starting With ';'
		if (readLine.StartsWith(";"))
		{
			printLine = null;
			total_comments = total_comments + 1;
		}

		// If line is not comment line check if its blank
		else if (readLine.Trim() == null || readLine.Length == 0)
		{
			printLine = null;
			total_blankLines = total_blankLines + 1;
		}

		// Check For Any Other Characters (Default Action)
		else if ((readLine.Trim() != null) && (!readLine.StartsWith(";")))
		{
			// Cycle Each Character
			foreach (char character in readLine)
			{
			if (tokenContinued == true)
			{
				temp_println = printLine;
				printLine = temp_println;
			}
			// Split Tokens At The Commas
			if (character == ',')
			{
				if (tokenInQuotes == false)
				{
				total_tokens = total_tokens + 1;
				Console.WriteLine("  (*) " + printLine);
				printLine = null;
				tokenContinued = false;
				temp_println = null;
				}
				else if (tokenInQuotes == true)
				{
				total_tokens = total_tokens - 0;
				printLine += character;
				tokenContinued = true;
				}
				continue;
			}

			if (character == '\"')
			{
				// Check For Start Of Quotation
				if (character == '\"' && tokenInQuotes == false)
				{
				tokenInQuotes = true;
				printLine += character;
				tokenContinued = true;
				continue;
				}

				// Check for end of Quotations
				else if (tokenInQuotes == true && character == '\"')
				{
				tokenInQuotes = false;
				printLine += character;
				tokenContinued = false;
				continue;
				}
			}

			// Check For Internal Comments
			if (character == ';')
			{
				total_comments = total_comments + 1;
				temp_println = printLine;
				printLine = null;
				printLine = temp_println;
				break;
			}

			// Handle all other characters
			if (character != ';' && character != '\"' && character != ',')
			{
				printLine += character;
				continue;
			}
			}
			// Print tokens at the end of the line
			if (tokenContinued == false)
			{
			total_tokens = total_tokens + 1;
			Console.WriteLine("  (*) " + printLine);
			parsedData.Add(printLine);
			printLine = null;
			temp_println = null;
			}
		}
		}

		Console.WriteLine("");
		Console.WriteLine("File Stats: ");
		Console.WriteLine("  (*) File Contains " + total_comments + " Comments");
		Console.WriteLine("  (*) File Contains " + total_blankLines + " Blank Lines");
		Console.WriteLine("  (*) File Contains " + total_tokens + " tokens");
		Console.ReadLine();
	}

	catch (Exception e){}
	return parsedData;
	}
  }
}



If you want the check against a file, you can copy the ones at:
This Code
The Main File To test With

Is This A Good Question/Topic? 0
  • +

Page 1 of 1