Friday, August 13, 2010

Percentage Sampling Transformation

What does the Percentage Sampling Transformation do?
This component is very simple – it splits a dataset by randomly directing rows to one of two possible outputs
 All you need to decide is in what proportion (as a whole percentage) you want the rows split into the two output data flows.The effect of the Random Seed can be seen in the sample package – if you run it multiple times you will get different results for the split each time, as each time you run it the Random Seed is different because the package decides what it is based on the tick count of the operating system (and no, I don’t know what that is either!). Note that in the example even though the percentage sample is set to 30% it’s unusual for the output rows to be split exactly 30:70. This is because the rows are allocated to an output by a throw of the randomisers dice. If you set a value for the Random Seed you fix the results of the throws and will always get the same rows sent to the same outputs, though there is still no guarantee it will be 30:70.  As the data set you split gets bigger, the impact of this effect will be less significant.

Where would you use this transformation?
The main use for this as far as Microsoft is concerned is carving up data sets for Data Mining into training and test cases. But anywhere you need to divide a dataset truly randomly















Print this post

1 comment: