Business Intelligence and Analytics
Data Warehouse
Pentaho

4. Inserting own code into transformations

  • Input files

  • 1.Transform words in uppercase, then calculate len_word and discard words shorter than 3 chars.
    [v1.ktr] [v2.ktr]
    Two versions:
    • - v1. By JavaScript
      Simple tasks with the Modified JavaScript Value step (JavaScript step).
      (JavaScript is a scripting language used in website development.)

      Javascript:
      var len_word = word.length;
      var u_word = upper(word);
      if (len_word > 3)
      trans_Status = CONTINUE_TRANSFORMATION;
      else
      trans_Status = SKIP_TRANSFORMATION;

      We use in the Fields tab the different values for Replace: Y/N based on the need to modify (e.g. word) or add a new value (e.g. len_word)

    • - v2. By Java
      Using the User Defined Java Class step allows Java code:
      public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
      Object[] r = getRow();
      if (r == null) {
      setOutputDone();
      return false;
      }
      if (first) {
      first = false;
      }
      Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());
      // YOUR CODE HERE
      putRow(data.outputRowMeta, outputRow);
      return true;
      }

      - The processRow() function at the beginning of the code is a predefined PDI function that processes a new row.
      - The getRow() function, another predefined PDI function, gets the next row from the input steps. It returns an Object array with the incoming row. A null value means that there are no more rows to process.

      YOUR CODE becomes:
      String word = get(Fields.In, "word").getString(r);
      get(Fields.Out, "word").setValue(outputRow, word.toUpperCase());
      long len_word = word.length();
      get(Fields.Out, "len_word").setValue(outputRow, len_word);
      if (len_word > 3)
      putRow(data.outputRowMeta, outputRow);


  • 2. Parsing unstructured files with JavaScript.
    Given as input a file with no tabular structure containing description of houses (file houses.txt), we need to obtain the property code useful to compare houses.
    [.ktr]

    Javascript:
    var prop_code;
    posCod = indexOf(text,'Property Code:');
    if (posCod>=0)
    prop_code = trim(substr(text,posCod+15));

    We take the substrings (15, length).


5. Trasforming the Dataset (When your dataset does not have the structure you like or need.)

  • Input files

  • 1. Sorting a dataset.
    Given as input "sales_data.csv", which contains product orders.
    We keep only a subset od the fields (Select values step).
    Then, we sort rows based on the PRODUCTLINE and SALES fields (Sort rows step) [.ktr]

  • 2. Aggregating data (Group by step).
    Given as input "sales_data.csv", which contains product orders.
    We calculate:
    - the total sales amount, the average sales, and the number of distinct dates in the dataset.
    - for each PRODUCTLINE, the total sales amount, the average sales, the number of distinct dates, and the number of distinct products.
    - for each PRODUCT, the total sales amount, the average sales, the number of distinct dates.
    [.ktr]

  • 3. Normalizing data: create columns from rows.
    Given as input "movies.txt", which contains descriptions of movies.
    We create for each movie a row of database. So, we need group by film, and calculate columns from the description.
    [.ktr]

  • 4. Normalizing data: create rows from columns.
    Given as input "cars.txt", which contains amounts and quantity of different vehicles as columns.
    We create a row for each type of vehicle.
    [.ktr]


CONTACT

If you have any further question or information request about these topics, please, contact me:

send an email to: linda.oro@icar.cnr.it