The Fuzzy Grouping transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicates and selecting a canonical row of data to use in standardizing the data.
The transformation output includes all input columns, one or more columns with standardized data, and a column that contains the similarity score. The score is a decimal value between 0 and 1. The canonical row has a score of 1. Other rows in the fuzzy group have scores that indicate how well the row matches the canonical row. The closer the score is to 1, the more closely the row matches the canonical row. If the fuzzy group includes rows that are exact duplicates of the canonical row, these rows also have a score of 1. The transformation does not remove duplicate rows; it groups them by creating a key that relates the canonical row to similar rows.
The transformation produces one output row for each input row, with the following additional columns:
_key_in, a column that uniquely identifies each row.
_key_out, a column that identifies a group of duplicate rows. The _key_out column has the value of the _key_in column in the canonical data row. Rows with the same value in _key_out are part of the same group. The _key_outvalue for a group corresponds to the value of _key_in in the canonical data row.
_score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row.
These are the default column names and you can configure the Fuzzy Grouping transformation to use other names. The output also provides a similarity score for each column that participates in a fuzzy grouping.
The Fuzzy Grouping transformation includes two features for customizing the grouping it performs: token delimiters and similarity threshold. The transformation provides a default set of delimiters used to tokenize the data, but you can add new delimiters that improve the tokenization of your data.
The similarity threshold indicates how strictly the transformation identifies duplicates. The similarity thresholds can be set at the component and the column levels. The column-level similarity threshold is only available to columns that perform a fuzzy match. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates. You specify the similarity threshold among rows and columns by setting the MinSimilarity property at the component and column levels. To satisfy the similarity that is specified at the component level, all rows must have a similarity across all columns that is greater than or equal to the similarity threshold that is specified at the component level.
The Fuzzy Grouping transformation calculates internal measures of similarity, and rows that are less similar than the value specified in MinSimilarity are not grouped.
To identify a similarity threshold that works for your data, you may have to apply the Fuzzy Grouping transformation several times using different minimum similarity thresholds. At run time, the score columns in transformation output contain the similarity scores for each row in a group. You can use these values to identify the similarity threshold that is appropriate for your data. If you want to increase similarity, you should set MinSimilarity to a value larger than the value in the score columns.
You can customize the grouping that the transformation performs by setting the properties of the columns in the Fuzzy Grouping transformation input. For example, the FuzzyComparisonFlags property specifies how the transformation compares the string data in a column, and the ExactFuzzy property specifies whether the transformation performs a fuzzy match or an exact match.
What does the Fuzzy Grouping Transformation do?
The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows.
Print this post
No comments:
Post a Comment