How to make sure specific lines in a text file are selected?

The problem is based on very specific code (not long but provided)

Page 1 of 1

0 Replies - 1474 Views - Last Post: 03 May 2008 - 04:54 AM Rate Topic: -----

#1 moerl  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 30
  • Joined: 02-June 06

How to make sure specific lines in a text file are selected?

Post icon  Posted 03 May 2008 - 04:54 AM

This code serves a very specific purpose and is designed, as you can see, to act on a specific file. Obviously it could be modified so that it could take any text file, but that is not required. This link shows the contents of the input file referenced in the code: http://archive.ics.u...-operative.data

This code currently does what it should just fine. What I've realized I need to add,however, seems complicated and I am unsure how to implement it.

The functionality I'd like to add is this: I need to make sure that the tuples that end up in the "testing.csv" file contain AT LEAST one example of each possible value for the last value on each line. In other words, I need to make sure that the testing.csv file contains tuples that contain at least one example of a tuple whose last-of-line value is "A", "S" or "I". All three must be represented in testing.csv.

Does anyone know how I could do that?

import random

# randomSplitter.py
# 
# Description:
# This program creates two text files called "testing.csv" and "training.csv"
# based on the file "Post-Op_Patient_Data_Set.csv" in which each line
# contains exactly one instance or example of the data.
# The program does so by randomly picking 80% of the original lines of the
# provided text file and writing those out to "training.csv". The remaining
# 20% of lines of the original file are written to the file called "testing.csv".
# 
# Author: moerl

print "### RandomSplitter ###\n"

# Get input file name and define output files
inFileName = "Post-Op_Patient_Data_Set.csv"
trainingFileName = "training.csv"
testingFileName = "testing.csv"

# Open files for reading and writing
inFile = open(inFileName, 'r')
trainingFile = open(trainingFileName, 'w')
testingFile = open(testingFileName, 'w')

numInstances = 0
 
# Process all lines in the file. For each line, randomly determine whether it should
# be moved into the N "bucket" or the 100 - N bucket (or the training file and test file, 
# respectively). I chose N = 80.
for line in inFile:
	# Generate a random floating point number between 0.0 and 1.0
	randNum = random.random()
	
	# Randomly assign (100 - N)% of the instances of the original file to testingFile...
	if randNum >= 0.8 and randNum <= 1.0:
		testingFile.write(line)
	# ... and the rest (N%) to trainingFile.
	else:
		trainingFile.write(line)

# Print friendly confirmation message of a job completed;)
print "Successfully split \"" + inFileName + "\" into \n\"" + trainingFileName + "\" and \n\"" + testingFileName + "\"!"

# Close all open files to conclude file processing
inFile.close()
trainingFile.close()
testingFile.close()

This post has been edited by moerl: 03 May 2008 - 06:48 AM


Is This A Good Question/Topic? 0
  • +

Page 1 of 1