12 Replies - 4097 Views - Last Post: 27 July 2012 - 08:00 AM
Create Table from CSV
Posted 25 July 2012 - 08:25 AM
Replies To: Create Table from CSV
Re: Create Table from CSV
Posted 25 July 2012 - 09:54 AM
Re: Create Table from CSV
Posted 25 July 2012 - 12:48 PM
I'm not sure what your ultimate goal is here other than maybe to import any sort of data without creating the table ahead of time.
The Import Wizard will build a table for you (albiet with it's own "opinions" of what data type is most "appropriate" - you'll get a lot of NVARCHAR here if you're not careful). Getting it to do an analysis run to suggest types helps, but I fight the thing on a regular basis. Sometimes you just want a table in the database and really don't care if the data types are appropriate or not as long as the data is all there and in SQL Server where you can then use T-SQL to straighten it out.
But after that you pretty much need to have actually "designed" tables. The only reason I would use a "take it as it is" table is to get the data into SQL Server where I can use T-SQL on it. I load it into a staging schema, partially to keep it isolated from "real" data. And I immediately parse it out to properly normalized and designed tables.
If you evaluate each data element in each column you can clasify it from least restrictive to most restrictive and also evaluate the maximum length needed. Your evaluation process could get pretty complicated if you go into depth with it. For example, you might evaluate the data and allow for numerous different datetime formats. Or, what's the most appropriate precision for decimal format once you've evaluated that no data in the column can not be stored as a real number.
Personally, if I were going to write such an import procedure in C#/CLR (rather than just using the built in Import Wizard), I would just have it import everything in as NVARCHAR (maybe even Max rather than a specific size). That's because I would just be using it as a temporary stop towards parsing it out using T-SQL, so that I could manually get it into well designed tables. I would drop the table as soon as I had it's data moved off to more appropriate tables. (In fact, as I think about this more, I would probably have it create this table in TempDB to not only assure it's temporary status, but also to avoid the load of NVARCHAR(MAX) on the production disk drives. "Hopefully", if you are on a production server, TempDB is on it's own set of dedicated hard-drives, isolated from the database, log, and other drives. This configuration is VERY useful for improving index rebuilding performance and a whole host of other things. But sadly noone ever configures it this way, but instead just throws everything on to one RAID array or SAN with only minimum thought to over all performance).
Incidently, if you haven't used the Insert Wizard before, I believe you will find it by right clicking on the database name in Management Studio (2005 & 2008). Look under "Tasks". And I believe it's then under "Import Data".
If the Import Wizard doesn't do what you need, maybe we should talk about why you're doing this in the first place. You definately want a lot of thought to go into all permenant tables in a professional database. Other than on a temporary basis, it would be bad to allow just anyone to come in and upload whatever data they want without any thought to such things as normalization and such.
Anway, hopefully that helps.
This post has been edited by BBeck: 25 July 2012 - 01:18 PM
Re: Create Table from CSV
Posted 26 July 2012 - 09:20 AM
Thanks everyone for the great replies!
Re: Create Table from CSV
Posted 26 July 2012 - 10:04 AM
Getting started, the first thing to learn (and learn well is backups). If you've got good backups "most" other mistakes are "forgivable", because you can get back what you need from the backups even if it means restoring the backup to another database or another server and moving over one record at a time to fix the problem.
Understanding how the transaction log works is a big part of the backup strategy. And you need to know how both the Full Recovery Mode and the Simple Recovery Mode work, and why you would use one over the other.
That stuff is pretty critical. The choice between full and simple recovery is "less" critical than the backups themselves except that if you don't understand them you may not be doing good backups in the first place. They are tied pretty close to the backups.
After that - if you're building databases by setting up tables and doing indexes, the next most important thing to understand is database normalization. You will have bad design if you don't understand that subject pretty well.
SQL Server is very forgiving of bad design when you have a small amount of data (such as a database 10GB in size or smaller). But if the database grows significantly larger than that, those small design mistakes will bring the database to it's knees. The more records you have, the more it will push well beyond 10GB and the more severe small design mistakes will become.
Learning normalization is the first step in learning correct database design.
After that, maybe indexing is a good thing to learn and know the differance between clustered and non-clustered indexes. You should already know foreign key/primary key relationships once you learn normalization.
Maybe after that blocking and how to resolve a blocking chain would be really good to learn. If you start having a lot of data with a lot of users, you may get into blocking problems pretty quickly. Knowing how to resolve a blocking chain is something every DBA is expected to know.
Anyway, that's a bit off topic for this thread, but I know it can be hard to know what's important to learn and what's not so important to learn when you're first starting out.
Re: Create Table from CSV
Posted 26 July 2012 - 10:46 AM
This seems like a trivial matter at first. I mean does it really matter whether you use an INT or a SMALLINT? It's only 2 bytes difference! How can two bytes possibly matter? Well, it doesn't when you have 100 rows in the table. But it makes a huge difference when you have a lot of rows. That 2 bytes of wasted space is 2 bytes in every row. So, if you have a million rows (a medium sized table for a lot of businesses) that's a Megabyte of wasted space. So, a lot of people - at this point - think "Big deal! I wasted a megabyte of disk space. A one terabyte hard drive costs practically nothing these days."
But you have to understand that it isn't just disk space we're wasting here. It's also wasted time pulling in that table from disk which is already a serious bottle neck without wasting any time. But it doesn't stop there. To use the data it has to be pulled into memory and on a table scan that's 2MB of memory wasted. Memory is in not in TB but GB, so this is already looking more significant. Not only is it wasting more time reading 2MB of empty space in memory, but it's also taking up memory space that could be used for something else to make something else run at a reasonable speed. But the wasted space took up the memory it would of been using and now that has to be loaded from disk when needed as well, further slowing things down. If the data is sent across the network (which at least some subset almost always will) it's wasting some network bandwidth as well, not only slowing itself down but possibly slowing down other network communication.
And any column used in an index will get stored in the index. So, we've wasted 2MB of space in the index as well, reducing the performance of the index, andcausing an additional loss of disk space due to the increased index size for every index that uses this column, and likewise causing additional memory to be consumed because of the wasted space. With 4 indexes we've stored the column 4 times and increased the magnitude of the problem by at least 4.
And that's just a million rows. If the database grows to a billion rows (quite common at large companies), we've wasted 2 giga-bytes of memory on this one column before you even consider whether it's an indexed column. A loss of 2GB of memory is a serious issue.
And remember, this is just one datatype in one column in one table. If you chose the wrong datatypes in most of your columns in most of your tables, you will just absolutely kill the performance of your database. And indexing isn't a solution because the problem may actually be in the indexes themselves, as I mentioned. Indexing will help, but it's not going to save the database for long.
Now, you're probably not going to see problems with this until you have over 10GB of data in the database. If you have 10,000 rows in each table, chances are pretty good that you will never see a problem with this as long as the database doesn't grow significantly.
But I tell you this to get you thinking about it. It will make a difference in a large database. So, learning about it and making it just a part of your normal design procedures will make you a much better DBA when you get to working with large amounts of data. If it's habit by the time you get there, you'll be much more prepared to deal with those large database.
This post has been edited by BBeck: 26 July 2012 - 10:53 AM
Re: Create Table from CSV
Posted 26 July 2012 - 10:56 AM
Right now I am in computer engineering and have taken many classes on optimization so I am always on the look out for that. One of the first things we had to learn, "What if we are creating something on a phone where space IS an issue". I got a good understanding of PK and FK and now plan on looking into chaining!
Re: Create Table from CSV
Posted 26 July 2012 - 01:14 PM
Sounds like you are off to a great start of a DBA career!
I sent you a private message with an article that I wrote. You might find some interesting info in there, although you may know a lot of it already.
Saying your having problems with the transaction log filling up concerns me. My first reaction is to assume that it isn't being managed properly. That's a problem I've seen quite often. I've even seen "supposedly" mid to senior level DBAs setting this up wrong.
The first thing to understand is what the Transaction Log is and isn't. Microsoft probably made a mistake by calling it the Transaction "Log". People naturally think it's about as important as an error log. Nothing could be further from the truth. You could almost say that the Transaction Log is the database. And you just simply can't get more important than that.
The WRONG solution to this probably is to keep truncating the transaction log and shrinking the .LDF file. In Books On Line it says:
None. The transaction log is automatically truncated when the database is using the simple recovery model. If you must remove the log backup chain from a database, switch to the simple recovery model.
(I swear BoL has a stronger admonition against it than that somewhere, but I can't remember where off the top of my head.)
If I had to choose between permenantly losing the database or permenantly losing the transaction log, I would say without hesitation "Give me the Transaction Log". It's that important.
Basically, it stores everything that happens in the database. When a transaction commits to the database, that doesn't mean it's in the .MDF file. That means it's in the .LDF file (transaction log). If you lose the .MDF file, it's not that big of a deal if you have good backups. But if you lose the .LDF file, there's a very real chance that you've permenantly lost some data. For some companies losing 1 or 2 hundred records is not that big of a deal. For others, it means the business closes the doors permenantly (think banks or patient records at a hospital).
Anyway, if the transaction log is filling up, there are a handful of problems possibly at work.
Ideally, a DBA monitors the internal sizes of the .MDF(database) and .LDF(transaction log) files on a daily basis. the external size of these files is their size on disk. But 100% of that space is allocated on disk when they are created even if they are empty. The growth inside the file (which is harder to see) is what's important. Ideally, the DBA should be watching this growth on a daily basis and have a good idea of what the trend looks like. The DBA should be deciding once or twice a year how big these files should be.
Really, these files should have their size set when the database is built and then be left alone from then on. In a perfect world, they would have their size set to the maximum that the server can support because they would be on a dedicated server. In the real world, databases often have to share disk space with other databases and such and have to "play nice" by sharing the space. In that scenario, the DBA should be resizing them a couple times a year (and no more) just to keep them from filling up based on the growth he's seen in them over the past 6 months or so.
The reason that you want to set them once and never grow them is that they are files, and like all files they can fragment. A lot of DBAs don't even know what file fragmentation is, let alone have the ability to fix it. When you start talking to them about fragmentation they start thinking about index fragmentation and such which has nothing to do with this. If you fragment the .MDF and .LDF file on disk you can defrag the hard-drive but it will only work while the database is shutdown completly. It's generally better just not to constantly change the files sizes. And really, there's rarely a good reason to change them in a growth direction, and almost never a good reason to shrink them.
Usually, what happens is that DBAs don't know the difference between the Full Recovery Model and the Simple Recovery Model. Not knowing the difference, the setup the backups completly wrong, risking the inability to restore the database properly. Companies often don't do "fire drills" (where the DBA team restores a backup to a test server every month to test the validity of the backups and their own skills at doing a restore), and so no one actually knows if the backups are any good anyway.
So, not knowing the difference, they leave the database in Full Recovery Mode (which is what a database will default to because it's the "safest"). The Full Recovery Model keeps every transaction that occurs and allows "Point in Time Recovery". This is necesarry if you can't afford to lose a single record. It's the wrong solution if you can afford to lose records. In order to maintain this Full Recovery strategy, SQL Server never throws a transaction away.
The way it works is that the full backup begins and it marks the transaction in the log where the backup was started. The backup goes into the full backup file. All transactions are commited that can be and the backup marks the last transaction that was committed. The transactions during the full backup are then copied into the backup to make the backup good right up to that final transaction. The transaction log does not have any of the transactions removed from the transaction log, so it's internal size does not decrease and it continues to fill up.
So, in the Full Recovery model you want to be able to restore the full backup file and then use the .LDF file to restore transaction that were not saved in the backup. They are in the .LDF file and the backup knows the first spot in the transaction log where it needs to begin re-applying those transaction. That can restore things right up to the very second of failure. Or, you can use it to restore a database to a certain point in time, such as 10 minutes before the DBA accidently dropped the most important table in the database. This is very useful and why databases are created in Full Recovery by default.
The problem is that that .LDF transaction log file will just keep growing and growing forever and ever until the hard-drive fills up and the server crashes. The reason for this is full backups are not supposed to remove anything from the transaction log.
So, the correct way to resolve this issue is to either rethink your recovery strategy, or run Transaction Log backups. The Transaction Log Backups work by moving all committed transations out of the Transaction Log and into the Transaction Log Backup file. This shrinks the data inside the .LDF file because it's been moved into the Transaction Log Backup file where it's safe and sound and ready for your recovery. The only thing left in the Transaction Log (.LDF) at that point is uncommitted transactions, in other words transactions currently in process.
If you run several Transaction Log backups in between your Full backups the internal size should never grow large enough to cause problems. If the Transaction Log still fills up after this, you either have something like a Bulk Load going on that is sending an enormous number of transaction into the database on a regular basis, or the size of the .LDF file is just way too small and needs to be increased until it's too big to fill up even if that means buying a new server. But generally, you should find that your Transaction Log's internal data size is very small if you do Transaction Log backups on a regular basis (like every 15 minutes, or every hour, or every 4 hours depending on how many transactions per second the database has).
Simply truncating the Transaction Log PERMENANTLY throws away all commited transactions. That means they are in the database, but they aren't in the backups or the .LDF file anywhere. If you have to restore the database, the restore chain ends at the last Full backup and goes absolutely no further. So, if you have to restore from backup, you are guaranteed to have lost 100% of what happened since that last Full Backup, which is often 24 hours worth of work in the database gone forever.
Most DBAs know that if they truncate the log they need to immediately do a Full backup just to restore the backup chain. But this is still against Microsoft's best practices. You only want to truncate the Transaction Log in the event of an emergency, and then you need to drop absolutely everything you are doing and do a Full Backup immediately.
Or, you can just do regular Transaction Log backups and not have the problem of the Transaction Log filling up in the first place.
For some companies, losing all those transactions may not be that big of a deal. A lot of companies say, "You know, it's not that important if we lose 24 hours of data. And we don't want the hassle of doing regular Transaction Log backups." These are often the companies that get into this problem because they aren't clearing the log because they aren't doing Transaction Log backups.
For those companies, Microsoft recommends using the Simple Recovery model. The Simple Recovery model does basically the same thing as constantly truncating the Transaction Log, but it is the "recommended" way to do this if you are willing to lose data. In the Simple Recovery model, the Transaction Log just throws away commited transactions. They are in the database (not in the backups). But if you take a Full (or Incremental) backup every hour, you are only going to lose something like an hour's worth of data, because you have the backups. It just guarantees that you are going to lose 100% of everything since the last good backup. Whereas, Full Recovery mode done properly allows you to restore right up to the second of the crash.
So, if you see Transaction Logs growing out of control, you probably need to ask whether the database needs Point in Time Recovery or whether it's ok to lose a certain number of hours (since the last backup) worth of data. If it's okay to lose data, the Simple Recovery model will keep the Transaction Log's internal size as small as it possibly can be (you have to have it for Transactions in process). If it's not okay to simply throw data away. If you need to make every attempt possible to save every record possible, then the correct solution is to set the database to Full Recovery mode and then do regular Transaction Log backups that remove enough of the transactions in the log to free it up of space regularly throughout the day. With either one of these setups, it should be extremly rare to see the Transaction Log fill up or need to truncate it.
This post has been edited by BBeck: 26 July 2012 - 01:17 PM
Re: Create Table from CSV
Posted 26 July 2012 - 03:00 PM
But Simple Recovery is there for a reason. I have some databases that are reporting databases. They are read only. We get a big data dump once a month. I also have databases where the managers have just said, "We don't care if we lose an entire 24 hours of data in that database". In those cases, there may not be much point in doing Point in Time Recovery. I set them to Simple Mode and do a Full Backup nightly. Management is ok if we lose that data since the last Full Backup, so so am I. Truthfully, it can all be reloaded from other systems.
But on a production system where you can't afford to lose any records, you want Full Recovery with regular Full, Incremental, and Transaction Log backups.
Re: Create Table from CSV
Posted 27 July 2012 - 08:00 AM
Incremental backups are what's on your non-SQL backup tapes. Differential backups are what SQL Server does. I can explain to you how a differential backup works "under the hood", but I can't remember it's name aparently. LOL
All of our backups have been setup and scheduled long ago. I haven't really had to mess with them much lately. :-)
Anyway, Incremental backups (non-SQL Server) and Differential backups are similar in concept. But it just sounds less professional to use the wrong word. :-)
Shame on me, I'll go slap my hand with a ruler now. :-)
This post has been edited by BBeck: 27 July 2012 - 08:02 AM