Rule types¶

This section describes each rule in terms of the following.

Name of rule

Required arguments

Optional arguments

Description

Notes

The types of rules supported in data preparation are as follows:

drop

header

settype

setformat

rename

keep

delete

replace

set

derive

split

merge

extract

countpattern

nest

unnest

flatten

aggregate

pivot

unpivot

join

union

window

In addition to these rules, data preparation provides various expressions, thereby supporting almost every function required for general data preprocessing.

drop¶

Required arguments

Column: A list of target columns

Description

Deletes the selected columns.

settype¶

Required arguments

Column: A list of target columns

New type: Select one out of Long, Double, String, Boolean, and Timestamp

Optional arguments

Set format: A format string (Joda Time) in the case of timestamp

Description

This rule changes the type of the selected columns.

The rule is considered successful even if the result is a type mismatch, which should be separately addressed.

setformat¶

Required arguments

Column: A list of target columns

Set format: A Joda-Time format string

Description

This rule changes the display format of a Timestamp column.

The target column must be of the Timestamp type.

Notes

As shown below, the format input field lists different entries depending on the input. The candidate list is narrowed as more values are entered.

rename¶

Required arguments

Column: A single target column

New column name: New name

Description

This rule changes the name of the selected column.

To rename two or more columns at once, click the Rename multiple columns button at the bottom of the command input panel to display the following popup.

keep¶

Required arguments

Condition: A conditional expression returning a Boolean value

Description

All rows are deleted except the rows that return true for the conditional expression.

delete¶

Required arguments

Condition: A conditional expression returning a Boolean value

Description

All rows that return true for the conditional expression are deleted. This is the opposite of keep.

replace¶

Required arguments

Column: A list of target columns

Pattern: A string pattern to be replaced

In the case of a constant string: Characters enclosed inside ' ('Houston', 'Naperville', 'Philadelphia' etc.)

In the case of a regular expression: Characters enclosed inside / (/[ ,_]+/, /\s+$/, etc.)

New value: A new string expression to replace the specified pattern

Constant string

Regular expression $1_$2_$3, etc.

Optional arguments

Ignore between characters: Does not make any replacement for content between the characters entered here

Match all occurrences: Whether all characters of a word must match

Ignore case: Whether to make the strings case-insensitive

Description

String replacement is performed for the selected columns.

Notes

Do not use ' or / in a new value.

Values from other columns are not available as new values. replace performs string replacement for content in the selected columns only. (cf. set rule)

set¶

Required arguments

Column: A list of target columns

Expression: An expression to be applied to the values of the target column. Values from other columns may be referenced. (cf. replace rule)

When multiple columns are involved, use a $col variable, which will be substituted by the respective target column during each conversion.

That is, when applying the set command on column1 and column2, $col becomes column1 during conversion of column1, and $col becomes column2 during conversion of column2.

Optional arguments

Use only under the following conditions

The set rule is applied only to rows satisfying this condition.

This rule may be regarded the same as the WHERE statement in SQL.

Description

This rule replaces the values in the selected column with results returned by the expression.

When using a complex expression, click the Advanced editor to display the popup shown below:

In the Advanced editor, you can edit the expression in a larger window while viewing the column list and a list of functions and their descriptions, and also run a validity check before implementing the expression.

derive¶

Required arguments

Expression: An expression whose resulting values are to form a new column. Similar to the set rule, values from other columns may be referenced.

New column name

Description

While similar to the set rule, this rule creates a new column instead of replacing an existing one.

Notes

The new column is inserted after the last existing column in the expression.

split¶

Required arguments

Column: A list of target columns

Pattern: A string expression that serves as a separator that splits the target strings. Allows a regular expression as is the case for the replace rule.

Number: Number of columns to be divided into.

Description

Each row is split by the given Number - 1.

When the pattern is no longer matched, the rest columns contain a null.

Notes

Note that columns are created as many as the Number input.

merge¶

Required arguments

Column: A list of target columns

Delimiter: A constant string with which values of different columns are concatenated.

New column name

Description

The target columns are merged with the Delimiter into a new column.

Notes

Similar to the replace rule, enclosing with a ' may be skipped. That is, strings not enclosed by / or ' are automatically enclosed by '.

extract¶

Required arguments

Column: A list of target columns

Pattern: A string pattern to be extracted. Allows a regular expression as is the case for the replace rule.

Number: Number of instances to be extracted

Optional arguments

Ignore between characters: Does not make any replacement for content between the characters entered here

Ignore case: Whether to make the strings case-insensitive

Description

A new column(s) with content matching the given pattern is created.

Notes

When there are multiple target columns, the resulting columns are inserted after each target column.

countpattern¶

Required arguments

Column: A list of target columns

Pattern: A string pattern to be detected. Allows a regular expression as is the case for the replace rule.

Optional arguments

Ignore between characters: Does not make any replacement for content between the characters entered here

Ignore case: Whether to make the strings case-insensitive

Description

New columns are created based on the number of matches with the pattern.

This is highly similar to extract. The only difference is that it counts the number of matches, rather than extracting the matched content.

Notes

When there are multiple target columns, the resulting columns are inserted after each target column.

nest¶

Required arguments

Column: A list of target columns

Type: Map or Array

New column name

Description

The target columns are grouped into a new column of the given type.

Below are examples of grouping columns into an array and map, respectively.

unnest¶

Required arguments

Column: A single target column

Select elements: 0-base index for an array, or key value for a map

Description

A new column is created by extracting the selected elements from an array or a map.

Notes

The target column must be of the array or map type.

flatten¶

Required arguments

Column: A single target column

Description

Rows are created from elements of an array.

Notes

The target column must be of the array type.

If the target array column has four elements as shown in the above example, each original row of the array results in four rows. Non-array columns result in the same columns.

aggregate¶

Required arguments

Expression: A list of aggregate functions

Group by: A list of columns that group values by.

Description

A new column is added from the results of grouping by each combination of the elements from the GroupBy columns.

A column is created for each expression. For example, two columns are created if average and count are designated as expressions.

The available aggregate functions are as follows:

count()

sum(colname)

avg(colname)

min(colname)

max(colname)

Notes

Calculations are performed only for sampling results. Therefore, the snapshot?the results for the entire data?may be different.

Note that () must be inserted when using the count function.

count(colname) is currently not available.

pivot¶

Required arguments

Column: A list of columns subject to pivoting

Expression: A list of expressions whose resulting values form new columns (only aggregate functions are available)

Group by: A list of columns that group values by.

Description

Group By is performed for each combination of target columns and GroupBy columns. A dataset having the results as column values is created.

A set of columns is created for each expression. For example, if average and count are designated as expressions and the values in the pivoted columns are divided into ten groups, a total of 20 columns will be created.

Notes

This is used when performing GroupBy on at least two columns. (1 pivoted column, 1 GroupBy column)

Here, Rename multiple columns is useful as column names tend to get longer.

unpivot¶

Required arguments

Column: A list of target columns to be converted into values in new columns

GroupEvery: Number of columns (defaults to 1)

Description

Two columns are created?one contains the selected column names and the other contains their values. (If GroupEvery is set to 1)

If GroupEvery is the same as the number of selected columns, each resulting pair of columns contains the name and values of its respective original column. Therefore, If 10 columns are unpivoted with the GroupEvery argument set to 10, for example, a total of 20 columns are created.

Notes

Using the GroupEvery argument set to a factor of the number of columns will soon be supported.

<Where GroupEvery is set to 1>

<Where GroupEvery is set to the same as the number of columns>

join¶

Unlike other rules, join has a separate popup.

Required arguments (select in a popup or enter a value)

Dataset to join: A wrangled dataset in the same dataflow

Columns to join (toggle)

Join keys: Multiple values may be entered

Join type: Only inner join supported now

Description

Joins to the target dataset to create new columns.

This rule is the same as join used by a relational database.

The results can be previewed by clicking the Show result button.

Notes

The join keys must be included in the columns to join.

union¶

Similar to join, union has a separate popup.

Required arguments (select in a popup)

Datasets to union: Multiple selections allowed.

Description

The content of the selected datasets is also processed.

This rule is the same as union all used by a relational database.

Notes

The target datasets must coincide with the dataset that unions them in terms of column name, type, and number of columns.

window¶

Required arguments

Expression: A list of window functions

Group by: A list of columns that group values by. Row order created within each group. If not specified, the whole data is sorted based on the Sort by setting.

Sort by: Specifies columns by which the order of rows is determined. If not specified, data is sorted in the order of being inputted.

Description

Column values are created by calculating with the values of the preceding and following rows.

The rows are grouped first and then sorted within each group in the specified column order.

In the above example, each row value is averaged with the three preceding and following rows within the same State group.

If an immediately preceding row does not have the same state, earlier rows are searched.

The currently available window functions are as follows:

row_number()

lead(colname, int)

lag(colname, int)

rolling_sum(colname, int, int)

rolling_avg(colname, int, int)

In addition to window functions, aggregate functions may be used.

Notes

When using window functions, error messages may not be properly displayed in the event of insufficient arguments.

Rule types¶

drop¶

header¶

settype¶

setformat¶

rename¶

keep¶

delete¶

replace¶

set¶

derive¶

split¶

merge¶

extract¶

countpattern¶

nest¶

unnest¶

flatten¶

aggregate¶

pivot¶

unpivot¶

join¶

union¶

window¶