Summary: Troubled by enormous data extraction when using CSV/TSV/TAB? Our Software Developer at Suyati Technologies has come up with a unique answer to prevent data loss while reducing the time taken for extracting the data. According to the team – “We find that the time taken for extraction using our string handling approach on a file containing more than 5000 records is 4 times faster than the Excel Interop approach. The same code logic works for CSV/TSV/TAB extraction and has been tested with more than 300 files.” Read on to find out how they did it!
In most of the B2B systems, huge amount of data is transferred, usually in plain text format. For example, CSV (Comma separated value), TSV (Tab Separated value) or TAB (Tab Separated) file formats are used particularly in data generation by e-commerce partners, and is sent to multiple business end points for analytics and reporting.
The most popular way of handling these file formats would be to use Excel Interop. The problem that we faced in using Excel Interop is that it takes huge amount of time to process a file with more than 5000 records and ultimately gets timed out on the service or app that houses this extraction logic.
Another way to tackle extraction of data from CSV/TSV/TAB is to use OleDBAdapter with specific delimiters in the connection string. It works, but data will not be in the proper format as expected.
The need is to have a technique in place that can prevent data loss and work with better performance. Working on the assumption that complex problems could always have a simple solution, we came up with a solution from scratch that only uses string handling in .NET and pure mathematics to achieve this task of data extraction into a Data Table format.
The plain text data is read using StreamReader and data starting from the HeaderLength is taken as the finalData for the extraction logic. Since we know the kind of delimiter for a CSV/TSV/TAB file, we use this delimiter to strip the data into individual data. Column headers are calculated from data at the 0th index in the finalData string array till the line break, since data starts from the line break of the last column.
From this point, data is read row by row using the line break concept. We are also handling scenarios where data consists of a comma in a CSV file where delimiter is also a comma.
Using all this column information and row information, we construct a DataTable with the data extracted from CSV/TSV/TAB.
If the code needs to work for CSV Extraction, then we need to call our extraction method as follows.
// For CSV Extraction
- Extraction Extractionoperations = new Extraction();
- fileInfo = new FileInfo(@"D:TestTest.csv");
- HeaderLength = 0;
- operations.EndLineDelimiter = Constants.CSVEndLine;
- operations.FileDelimiter = Constants.CSVDelimiter;
- DataTable CSVData = operations.ExtractDataFromDelimitedFiles(fileInfo, HeaderLength);
If the code needs to work for TSV/TAB Extraction, then we need to call our extraction method as follows.
// For TSV Extraction
- Extraction operations = new Extraction();
- FileInfo fileInfo = new FileInfo(@"D:DevTestTest.tsv");
- int HeaderLength = 0;
- operations.EndLineDelimiter = Constants.TSVEndLine;
- operations.FileDelimiter = Constants.TSVDelimiter;
- DataTable TSVData = operations.ExtractDataFromDelimitedFiles(fileInfo, HeaderLength);
Here, one most important thing that we need to note is the HeaderLength variable which needs to be set with the row number from where the data starts in the excel sheet. Row number can be noted from excel as shown below (HeaderLength=5).
From our performance analysis, we find that the time taken for extraction using our string handling approach on a file containing more than 5000 records is 4 times faster than the Excel Interop approach. The same code logic works for CSV/TSV/TAB extraction and has been tested with more than 300 files.
The visual studio projects types are of 2008.