Summary: Troubled by enormous data extraction when using CSV/TSV/TAB? Our Software Developer at Suyati Technologies has come up with a unique answer to prevent data loss while reducing the time taken for extracting the data. According to the team – “We find that the time taken for extraction using our string handling approach on a file containing more than 5000 records is 4 times faster than the Excel Interop approach. The same code logic works for CSV/TSV/TAB extraction and has been tested with more than 300 files.” Read on to find out how they did it!
In most of the B2B systems, huge amount of data is transferred, usually in plain text format. For example, CSV (Comma separated value), TSV (Tab Separated value) or TAB (Tab Separated) file formats are used particularly in data generation by e-commerce partners, and is sent to multiple business end points for analytics and reporting.
The most popular way of handling these file formats would be to use Excel Interop. The problem that we faced in using Excel Interop is that it takes huge amount of time to process a file with more than 5000 records and ultimately gets timed out on the service or app that houses this extraction logic.
Another way to tackle extraction of data from CSV/TSV/TAB is to use OleDBAdapter with specific delimiters in the connection string. It works, but data will not be in the proper format as expected.
The need is to have a technique in place that can prevent data loss and work with better performance. Working on the assumption that complex problems could always have a simple solution, we came up with a solution from scratch that only uses string handling in .NET and pure mathematics to achieve this task of data extraction into a Data Table format.
Code Extraction Logic
The plain text data is read using StreamReader and data starting from the HeaderLength is taken as the finalData for the extraction logic. Since we know the kind of delimiter for a CSV/TSV/TAB file, we use this delimiter to strip the data into individual data. Column headers are calculated from data at the 0th index in the finalData string array till the line break, since data starts from the line break of the last column.
From this point, data is read row by row using the line break concept. We are also handling scenarios where data consists of a comma in a CSV file where delimiter is also a comma.
Using all this column information and row information, we construct a DataTable with the data extracted from CSV/TSV/TAB.
If the code needs to work for CSV Extraction, then we need to call our extraction method as follows.
// For CSV Extraction
- Extraction Extractionoperations = new Extraction();
- fileInfo = new FileInfo(@”D:TestTest.csv”);
- HeaderLength = 0;
- operations.EndLineDelimiter = Constants.CSVEndLine;
- operations.FileDelimiter = Constants.CSVDelimiter;
- DataTable CSVData = operations.ExtractDataFromDelimitedFiles(fileInfo, HeaderLength);
If the code needs to work for TSV/TAB Extraction, then we need to call our extraction method as follows.
// For TSV Extraction
- Extraction operations = new Extraction();
- FileInfo fileInfo = new FileInfo(@”D:DevTestTest.tsv”);
- int HeaderLength = 0;
- operations.EndLineDelimiter = Constants.TSVEndLine;
- operations.FileDelimiter = Constants.TSVDelimiter;
- DataTable TSVData = operations.ExtractDataFromDelimitedFiles(fileInfo, HeaderLength);
Here, one most important thing that we need to note is the HeaderLength variable which needs to be set with the row number from where the data starts in the excel sheet. Row number can be noted from excel as shown below (HeaderLength=5).
From our performance analysis, we find that the time taken for extraction using our string handling approach on a file containing more than 5000 records is 4 times faster than the Excel Interop approach. The same code logic works for CSV/TSV/TAB extraction and has been tested with more than 300 files.
The visual studio projects types are of 2008.
4 comments
I tried a few tsv files with the sample application and it works with lightening speed. The code looks pretty complex but hats off in deriving a code with just string handling techniques. I think this is definitely a great parser.
I had a code which uses(using JET instead of ODBC) and have run into a serious drawback. I have a field that is numbers, but I use a CSV format instead of Excel because I NEED to keep the leading zeros. When I use JET, it reads that field as a double and drops the leading zeros.
This is a great relief. I am happy to see my leading zeros still intact after reading my CSV. 🙂
I have a .CSV file from which I read the data and put it in a dataset, the dataset then undergoes various actions and I end up putting the data into a sql server2005 database. The following is the code that I am using to access to .CSV file and extract the data:
cnnCSVConnection = New OleDb.OleDbConnection(“Provider=Microsoft.Jet.OLEDB.4.0;Data Source=” + path + “;Extended Properties=Text;”)
Dim cmdCSVCommand As OleDb.OleDbCommand = New OleDb.OleDbCommand(“SELECT * FROM ” & fileName, cnnCSVConnection)
Dim adpCSV As OleDb.OleDbDataAdapter = New OleDb.OleDbDataAdapter(cmdCSVCommand)
cnnCSVConnection.Open()
adpCSV.Fill(dstCSVFile, “MemberData”)
The problem is that the select command doesn’t always return all the data in the .CSV file. The current problem that I am having is that the first column is being returned as blank. I have attached the .csv file for people to have a look at (Note the file had to be uploaded as a .txt file). The problem only seems to occur when there is a limited amount of data in the .csv file. If I submit a file with 5 or 133 records no problems occur.
What am I doing wrong? Would be glad if I can get the VB.NET version for csv data extraction.
Thanks
Wendy
Thank you all for your responses.
Regarding Wendy’s question, it often happens with OleDB that the data might be garbled during extraction or sometimes the connection might refuse to open if the file has some problem. Therefore, a generic parser using pure string handling technique would be apt for extraction of data from plain text format files.
Regaring the VB.NET version of this code, We suggest you to try online C#.NET to VB.NET code converters.
Do let us know your comments. Thank you guys 🙂