Business Intelligence and Analytics
Data Warehouse

0. Installing Pentaho Data Integration (PDI) and Working with Spoon

1. Introducing Transformations

  • Input files
  • 1. Simple transformation: say hello to a list of persons. (.PDF) [.ktr]
  • 2. Transformation that read a list of projects from a csv file, and then it will calculate the time that it took to complete each project. [.ktr]
  • 3. Transformation that generates a dataset with all dates in between a given range of dates. (.PDF) [.ktr]

2. Filtering, Cleansing, Validating, and Fixing Data

  • Input files

  • 1. Reading a file and getting and filtering words list:
    - v1. Remove no relevant words, for example punctuation, by using the regular expression. Filtering rows based on conditions and comparisons, for instance word_lenght.
    - v2. Remove no relevant words, for example punctuation, by using Java Filter step.
    - v3. Filtering by using a Dictionary: we want to eliminate words in a dictionary, for instance stop words.
    (v1_v2.PDF) [v1.ktr] [v2.ktr] (v3.PDF) [v3.ktr]

  • 2. We have data from more sources:
    - Stream with People information along with their nationalities
    - A secondary stream with information about the languages (countries.xml)
    We want to merge different information to find which language people speak.

    Find which language people speak
    - v1: by using Lookup,
    - v2: by using Lookup and Standardizing information. In the inputs we have different values for a country, so for the mapping we need to standardize the countries names, in particular we have "Russia" vs "Russian Federation".
    (v1.PDF) [v1.ktr] (v2.PDF) [v2.ktr]

  • 3. Match countries by fuzzy matching
    - v1. We deal with non-exact matches.
    There exist several algorithms based on a metric distance (i.e. the comparison is based on how the terms are spelt. An example is the Levenshtein algorithm: it calculates the distance between two strings as the number of edit steps (character insertion or deletion, or replacements) needed to get from one string to another.
    You can try the Damerau-Levenshtein algorithm, which is similar to Levenshtein but adds the transposition operation.
    - v2. In this version we remove duplicated rows.
    (v1_v2.PDF) [v1.ktr] [v2.ktr]

  • 4. Validating product codes. The file sales_data.csv contains products. The field product code is made up of two parts: a prefix and a number. We expect that the prefix be one of the following: S10, S12, S18, S24, S32, S50, S70, or S72.
    - v1. Implement a validation: If the value of prefix code doesn't belong to the expected list, the row will be reported as an error and the row of data will be discarded. So, the first version just reports errors to the log about invalid product codes.
    - v2. In this version we we mark invalid code of products with "invalid".
    (v1.PDF) [v1.ktr] (v2.PDF) [v2.ktr]

3. Introducing Jobs
Job must start with a START Job entry
Types of hops:
- gray: unconditional execution, the destination entry executes no matter what the result of the previous entry is.
- green: successful execution, the destination entry executes only if the previous job entry failed.
- red: destination entry executes only if the previous job entry failed.

  • 1. Simple Job that create a folder.

    Folder name option, type

    ${Internal.Entry.Current.Directory} is an internal PDF variable, that in execution time resolves to the folder where the Job is saved.

  • 2. Usually, you don't run a PDI Transformation isolated; you embed it in a bigger process. (For example: Download a file, clean it, load the information of the file in a database).
    These types of processes can be implemented by PDI jobs.
    One of the tasks of those jobs will be executing one or more transformations. This execution can be done with a special Job entry: the transformation entry.


    Generate Rows --> Calculator --> Clone Row --> Calculator

  • 3. Managing files: Creating a Transformation that moves some files (... use it in a job)



If you have any further question or information request about these topics, please, contact me:

send an email to: