SSIS – split big file into manageable pieces

We have a process that is taking a long time because the file we receive is pretty big.  What I’m doing is as follows:

  • Create a stripped down file with just this year’s data in it
  • Split that file into 5 much smaller pieces
  • Move everything from the directory except the 5 files

We have another process that was taking a while to ingest and do what was needed with the big file, so I set it up so that all they have to do is process the text files in this one directory.

 

 

 

 

 

 

 

 

 

Pieces:

Created flat file connection managers to the Big file, the Yearly file and the 5 split files.  I also created parameters for the Archive Directory I want to move unneeded files to and for the file name that will allow me to loop through all of the text files in the directory.

Data flow to Create the 2018 file

The Big File and the 2018 Data file are just flat file connectors to the large file and the yearly data file I’m splitting.

The Conditional split is using an expression to just keep the data with the right four characters equal to the year:
(RIGHT([Column 0],4) == “2018”)

Data flow to Create the 5 smaller files

 

This one was a lot easier than I thought, but there are some gotchas.  I was using SSDT 2015, and SQL 2012 didn’t like the Balanced Data Distributor.  I needed something quick, so I recreated it in BIDS and it works like a charm.  Something about the component not being registered, etc.

I used a Balanced Data Distributor to drop the rows into each of the files.  It’s not exactly even, two files are getting 4m, three are getting 3.7M, but I’m not picky.  I’d never used this component before but it ripped a 19M file into 5 separate files in less than 3 seconds, so I wasn’t going to complain.  It’s not configurable, but truthfully, I didn’t care.  Fast and clean is good.

Cleanup

For the last part, I set up a Foreach Loop Container to loop through my directory looking for *.txt and mapping the file name to the user variable I set up for it.

There’s a sequence container inside the foreach loop that has a constraint and expression to the file task.  I just connected them, adjusted the connection to look for SplitFile in the file name and if it found it, ignore it, otherwise, go to the file task.

The file task just moves the file to the archive directory based on the filename passed to it.

Voila.  Please let me know if there’s something I left out or leave a comment with a question.  All in all, a lot easier than I thought it’d be.