Term Lookup Transformation
The Term Lookup transformation matches terms extracted from text in a transformation input column with terms in a reference table. It then counts the number of times a term in the lookup table occurs in the input data set, and writes the count together with the term from the reference table to columns in the transformation output. This transformation is useful for creating a custom word list based on the input text, complete with word frequency statistics.
Before the Term Lookup transformation performs a lookup, it extracts words from the text in an input column using the same method as the Term Extraction transformation:
The Term Lookup performs a lookup and returns a value using the following rules:
- Text is broken into sentences.
- Sentences are broken into words.
- Words are normalized.
The Term Lookup performs a lookup and returns a value using the following rules:
- If the transformation is configured to perform case-sensitive matches, matches that fail a case-sensitive comparison are discarded. For example, student and STUDENT are treated as separate words.
Note:
A non-capitalized word can be matched with a word that is capitalized at the beginning of a sentence. For example, the match between student and Student succeeds when Student is the first word in a sentence.
When a lookup item contains terms that overlap in the reference set—that is, a sub-term is found in more than one reference record—the Term Lookup transformation returns only one lookup result. The following example shows the result when a lookup item contains an overlapping sub-term. The overlapping sub-term in this case is Windows, which is found within two reference terms. However, the transformation does not return two results, but returns only a single reference term, Microsoft Windows. The second reference term, Windows XP Home Edition SP1, is not returned.
Item | Value |
---|---|
Input term | Microsoft Windows XP Home Edition SP |
Reference terms | Microsoft Windows, Windows XP Home Edition SP1 |
Output | Microsoft Windows |
The Term Lookup transformation can only use a column that has either the DT_WSTR or the DT_NTEXT data type. If a column contains text, but does not have one of these data types, the Data Conversion transformation can add a column with the DT_WSTR or DT_NTEXT data type to the data flow and copy the column values to the new column. The output from the Data Conversion transformation can then be used as the input to the Term Lookup transformation.
The Term Lookup transformation input columns includes the InputColumnType property, which indicates the use of the column.InputColumnType can contain the following values:
- The value 0 indicates the column is passed through to the output only and is not used in the lookup.
- The value 1 indicates the column is used in the lookup only.
- The value 2 indicates the column is passed through to the output, and is also used in the lookup.
The Term Lookup transformation adds two columns to the transformation output, named by default Term and Frequency. Term contains a term from the lookup table and Frequency contains the number of times the term in the reference table occurs in the input data set. These columns do not include the CustomLineageID property.
The lookup table must be a table in a SQL Server 2000, a SQL Server 2005, or an Access database. If the output of the Term Extraction transformation is saved to a table, this table can be used as the reference table, but other tables can also be used. Text in flat files, Excel workbooks or other sources must be imported to a SQL Server database or an Access database before you can use the Term Lookup transformation.
The Term Lookup transformation uses a separate OLE DB connection to connect to the reference table.
The Term Lookup transformation works in a fully precached mode. At run time, the Term Lookup transformation reads the terms from the reference table and stores them in its private memory before it processes any transformation input rows.
Because the terms in an input column row may repeat, the output of the Term Lookup transformation typically has more rows than the transformation input.
The transformation has one input and one output. It does not support error outputs.
Print this post
No comments:
Post a Comment