Stuff: April 2013

Thursday, April 18, 2013

Datastage scenario with small example

Consider Input file.txt as below

telephoneno

09700020075

919889110102

918571233668

Output.txt

telephoneno

09700020075

09889110102

08571233668 and length should be 11.If first two characters is "91" i need to replace as "0"

Ans:

In Transformer use following derivation

If (Len(InLink.Phone) = 11) Then InLink.Phone Else "0" : Right(InLink.Phone,10)

Common steps for Datastge Job Development:

1)Understand the requirement clearly and also exceptions

2)Form the algorithm in simple english and Do not go for Job development directly

Algorithm:

a)Check whether length of the string is 11 or not

b)If it 11 pass the input else extract right most 10 characters and append with zero

3)Now convert the Algorithm into Datastage stages

a)To implement first point in the Algorithm we need to use Len function and also If statement

b)To implement second point in the Algorithm we need to use Right function

4)In most of the cases requirement is Spilted into Jobs and we have to identity the stages used in each Job after forming the algorithm which makes you debugging easier instead of implementing entire requirement in one Job.

To identify when to use which stage the following link will help you-when-to-use-which-stage-in-datastage

5)Performance tuning if you are facing any problem which is an iterative approach.

6)Finally connect all the Jobs through Job sequence depends on Dependency because one Datastage Job output may depends upon the Input of other Datstage Job

Wednesday, April 17, 2013

Difference between sequential file stage and Data set stage?

1) When you use sequential file as Source, at the time of Compilation it will convert to native format from ASCII.where as, when you go for using datasets conversion is not required. Also, by default sequential files we be Processed in sequence only. Sequential files can accommodate up to 2GB only. Sequential files does not support NULL values.All the above can me overcome using dataset Stage,but selection is depends on the Requirement.suppose if you want to capture rejected data in that case you need to use sequential file or file set stage.

2)Sequential file is used to Extract the data from flat files and load the data into flat files and limit is 2GB.Dataset is a intermediate stage and it has parallelism when load data into dataset and it improve the performance.

3)Data set mainly consists of two files.

a)Descriptor file which consists of Metada,data location but not actual data itself

b)Data file contains the data in multiple files and one file file per partition.

4)orchadmin command is used to delete the datasets where as rm unix command is used to remove the flat files.

Complete information about orchadmin can be found in the below

link-orchadmin

Difference between server jobs and parallel jobs?

Server jobs:-
1)In server jobs it handles less volume of data.
2)It is having less number of components.
3)Data processing will be slow.
4)Executed by Datastage serve engine
5)compiled into Basic
6)No parallel Capability is one of the drawback of Server jobs

Parallel jobs:-
1)It handles high volume of data.
2)Executed by Datastage parallel engine
3)It is having more number of components compared to server jobs.
4)Supports pipeline and partition parallelsim
5)Compiled into OSH

Similarity Between Server jobs and Parallel jobs:

Runtime monitoring in DataStage Director

Tuesday, April 16, 2013

COLUMN GENERATOR STAGE

1)The Column Generator Stage is a development/debug stage.

2)It can have a single input link and a single output link.

3)The Column Generator adds columns to incoming data and generates mock data for these columns for each data row processed.This is useful for testing a job when no real test data is available.

ROW GENERATOR STAGE

1)The Row Generator Stage is a development/debug stage. The stage has no input links and only one output link.

2)The row generator stage generates a set of mock data fitting the specified metadata. This is useful for testing a job when no real test data is available.

3)Metadata can be specified using a schema file.

Sort stage

1)Sort stage is a processing stage used to perform sorting operations on input data.

2)Need for Sorting:
Some stages require sorted input
ex- Join, merge stages

3)Sort stage requires a ‘key’ to be specified by which the sort is performed.Multiple sort keys can be specified

4)Sort operation is performed partition wise.To sort a complete set of data, you should change the Sort Stage execution mode to sequential.

5)There are two ways Sort can be performed in Datastage :

a)Within stages On input link Partitioning tab, set partitioning to anything other than Auto

b)In a separate Sort stage which has more options like Allow duplicates,case sensitive,sort order(ascending / descending) etc.

By default, both methods use the tsort operator which can be identified in Job score

6)Partitioning keys are often different than Sorting keys

Keyed partitioning (e.g.Hash) is used to group related records into the same partition where as Sort keys are used to establish order within each partition

Modify stage

1)Modify stage is a processing stage that alters the record schema of the input data

2)Modify stage can have a single input and a single output link.

3)Modify stage can also be used to handle NULL values, string, date, time and timestamp manipulation functions.

4)Modify stage is a native parallel stage and has performance benefits over the Transformer stage

Sunday, April 14, 2013

Funnel stage

1)The Funnel stage is a processing stage that combines multiple inputs into a single output. It can have any number of inputs and a single output link.

2)The metadata for all input data sets must be identical.

3)Funnel can be operated in three Modes and default Mode is Continuous

a)Continuous:

1)Combines the records of the input link in no guaranteed order.

2)It takes one record from each input link in turn. If data is not available on an input link, the stage skips to the next link rather than waiting.

b)Sort Funnel:
Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.

c)Sequence: Copies all records from the first input link to the output link, then all the records from the second input link and so on.

How to Use different Configuration file for different Job activities in a Job sequence?

Define a APT_CONFIG_FILE Environmental variable at Job level and mention the path names of configuration files to the jobs in the Job activities in a Job sequence.

Merge stage

1)Merge stage is a processing stage, which can have any number of input links and one output link, with same number of reject links as there are update links.

2)The input datasets to the Merge stage must be key partitioned and sorted.This ensures that rows with the same key column values are located in the same partition and will be processed by the same node.

3)Merge stage combines master data with one or more updates link data where the keys match.

4)Master and update links must have duplicate free data to ensure proper results.If the input data is not duplicate-free, the output generated will be improper.

5)Check link ordering to make sure the master and update links are proper otherwise the output generated will be improper

Lookup Stage

1)The ‘Lookup’ stage is a processing stage that can have more than one input links and one output link, as well as one reject link.

2)Lookup Failure options

Continue, Drop, Fail, Reject

3)If the lookup fails to find a matching key column, one of these actions can be taken:

Fail: the lookup Stage reports an error and the job fails immediately.

This is the default.

Drop:The input row with the failed lookup(s) is dropped

Continue:The input row is transferred to the output, together with the successful table entries.The failed table entry(s) are not transferred, resulting in either default output values or null output values depends on datatype.

Reject:The input row with the failed lookup(s) is transferred to a second output link, the reject link.

4)Sparse lookup can be used if the input data is smaller than the reference data.

5)Joins have better performance when the reference data is huge. Avoid lookups in such cases.

6)Lookup Stage does not need sorted input data where as for Join stage and Merge stage input data should be sorted.

7)Please find the below link to find the difference between Normal Lookup and sparse Lookup-NormalvsSparse

Join Stage

1)The ‘Join’ stage is a processing stage that performs a join operation on two or more input data sets and then provides output in the form of one resultant data set.

2)The following four types of joins can be performed:
• Inner join
• Left outer join
• Right outer join
• Full outer join

3)The join stage supports 2 or more sorted input links and 1 output link

4)The join stage editor allows you to specify the keys on which join is performed. More than one key can be specified. Specified keys should have same name on all links.

5)No fail/reject option for missed matches

6)Link ordering is very important while using left or right outer join and also the input data on all links to join stage should be sorted.

Capturing unmatched records from a Join:
a)The Join stage does not provide reject handling for unmatched records.If unmatched rows must be captured,an OUTER join operation must be performed,so that when a match does not occur, the Join stage inserts Null value into the unmatched non-key columns provided non-key column is defined as nullable on the Join input links.

b)After Join Stage Use Transformer to filter Null records with the help of IsNull Built function

Filter stage

1)Filter stage is used to transform (not modify) records from the input link based on a specific condition

2)Different criteria/requirements can be specified to direct data to different output links from the Filter stage.

3)This processing stage can have a single input link and any number of output links. It can also, optionally, have a reject link.

4)Switch stage supports only 128 output links whereas Filter stage can have any number of output links

Friday, April 12, 2013

When to use which stage in Datastage?

Copy STAGE-To drop a Particular column

Sort STAGE-sorting,generating Key change and similar to order by clause in oracle

Filter STAGE-Similar to where clause in oracle but we can not perform Join operation

Lookup,Join,Merge-To perform Join operation

Pivot Enterprise STAGE-Rows to columns and columns to Rows

External Filter STAGE-Filter the records by using Unix filter commands like Grep etc

MODIFY STAGE-Metadata conversion,Null Handling and similar to conversion functions in oracle

FUNNEL STAGE -Combining the multiple input data into a single output.Metadata should be same for all the inputs

REMOVE DUPLICATES STAGE-To remove duplicate values from a single sorted input.

ENCODE / DECODE STAGES:To encode/compress a data set using UNIX encoding commands like gzip etc

TRANSFORMER STAGE:
a)Filtering the Data(constraints)
b)Metadata conversion(Using Functions)
c)Rows to columns and columns to Rows(Using Stage variables)
d)Looping
e)Creating a counter(Using macros)-Counter using Transformer

SURROGATE KEY GENERATOR STAGE-To generate SURROGATE KEYs similar to oracle Database sequence

Aggregator Stage:To perform Group by Operations like max,min etc similar to Group by clause in oracle

ROW GENERATOR STAGE:To generate a set of mock data fitting the specified metadata when no real data is available

XML OUTPUT STAGE -To convert tabular data such as tables and sequential files to XML hierarchical structures.

SWITCH STAGE- It performs an operation similar to the switch statement in C and to filter the data

CHANGE CAPTURE STAGE-To identify Delta changes(inserts,updates,deletes etc) between two sources

oracle connector-To connect to the oracle Database.

Aggregator stage

1)Aggregator stage is a processing stage that can have one input and one output link. It classifies records from the input link into groups and computes the totals or performs specified aggregator functions for each group

2)Records can be grouped on one or more keys

3)In parallel environment, we need to be careful when partitioning. It can affect the result of the aggregator. If the records that fall in the same group are in different partitions, then the generated output will be wrong.Therefore, it is better to do Hash partition on grouping keys before the aggregator stage so that records with same keys will go to same partition.

4)In Aggregator two grouping methods(Hash and sort) are present.Please find the following link-Grouping Methods for more information about grouping methods in aggregator stage

Copy Stage

1)Copy stage is a processing stage that can have a single input and any number of output links

2)Copy stage is used to copy a single input data set to a number of output data sets

3)This stage is generally used for following things

a)Columns can be dropped.

b)The order of the columns can be altered.

Pages

Thursday, April 18, 2013

Wednesday, April 17, 2013

Tuesday, April 16, 2013

Sunday, April 14, 2013

Friday, April 12, 2013