So very recently in my working life I was offered a chance to build a database using SQL Sever. Historically I’ve always been an Oracle user, so I thought that this would be a great opportunity to get familiar with Microsoft’s enterprise database product and also capture some of the challenges that I came up against – right from the get go!
It wasn’t long before I hit some difficulties. First was the learning curve with security and its many layers, but I’ll post on that down the track. Today I want to cover data loading. Every database technology has its own set of import/export native tools and generally plenty of third party tools too. Generally speaking there are two types data ingestion. The first way is by executing an INSERT statement, where a statement must be built from every row of the source data that is to be imported. This is fine for let’s say ten thousand records, but when you want to import millions of records you could be waiting hours or even days! So most database technologies also have some sort of bulk insert method. This method inserts many rows in a single transaction, for SQL Server this is achieved using a command line utility called ‘bcp.exe’.
https://technet.microsoft.com/en-us/library/aa337544.aspx
Of course there is usually a point and click wizard to import data which is great for ad-hoc imports, but when you need to automate the process a command line tool is very handy. However there are always a few drawbacks. The first being the obvious that command line tools are always a pain in the backside to get working, one missing flag or a file path typo and it gives you some cryptic error code. The help function ironically, without context, or if not that, the common omission of good examples, usually ins’t as its name suggests at all! So trying to import data (which is always pretty annoying a the best of times) with a command line tool if probably one of the most frustrating things one can attempt.
Anyway here is a list of tips for things I found hard to do –
- First problem I had was that because bcp is a windows product it didn’t play well with Unix text files. Why? Well Windows typically recognizes the end of text line with a carriage return and a line feed {CR}{LF}. Hence specifying the normal \n row terminator (–r\n) did not work and gave me the following error:
Then If I specify the row terminator, I don’t get an error at all, just zero rows copied!
After much hunting I solved this problem by using 0x0A (-r0x0A) which is the Hex value for a line feed. Just a word of warning – using this will work for Windows files because they have both the {CR} and the {LF}, however the {CR} gets imported into your database.
For some further reading on this see Coding Horror’s post:
https://blog.codinghorror.com/the-great-newline-schism/
Also if you are windows based a really helpful for inspecting files is Notepad++:
- The format file. So in short if you want to remap or treat in individual columns differently you have to use the format file. According to:
https://msdn.microsoft.com/en-us/library/ms191516.aspx
Based on a table in your database Bcp can create a default format file which you are meant to edit in order to make it align with data file that want to load. Really? I just want to skip one column on import.. Well that is what you have to do. I was able to skip the column but it was a lot of work. There are actually two different file format types that you can use. The first is the old type, a cryptic header-less tab delimited file, which is what I used. Secondly the newer XML format which seemed easier to interpret but required a lot time to modify (however I didn’t actually try using it).
It would have required for more effort on Microsoft’s part, but it would have been nice if Bcp could have created the format file based on the destination table AND the source file.
- This was the worst thing that I found, and for an enterprise product I almost couldn’t believe this was true – there is no option available to set a text column qualifier! What that can’t be right? Afraid so folks.. (feel free to correct me if I’m wrong). So for the common scenario of having a CSV file, which for example contains a list of addresses that are enclosed double quotation marks, you are in trouble. Documentation states to use the format files. I thought about it, but no, I couldn’t bring myself to go down that path. Instead I figured an easier solution (which it was), was to write a small utility that would strip out all the quotation marks and swap the field delimiter commas with piped delimiters. I needed it done real quickly so I wrote it in VBA in about an hour. On modern a solid state drives it gave me reasonable performance for files of approximately 1GB.
You can find a copy of the source here:
https://github.com/Leighroyus/CSVdoubleQuoteRemover
Just a note, unintentionally this code is will convert Unix based {LF} text files to Windows {LF}{CR} text files.
Very insightful post, Leigh!