Thursday, April 18, 2013

Datastage scenario with small example


Consider Input file.txt as below

telephoneno 
09700020075 
919889110102 
918571233668

Output.txt

telephoneno 
09700020075 
09889110102 
08571233668 and length should be 11.If first two characters is "91" i need to replace as "0"

Ans:
In Transformer use following derivation

If (Len(InLink.Phone) = 11) Then InLink.Phone Else "0" : Right(InLink.Phone,10)

Common steps for Datastge Job Development:

1)Understand the requirement clearly and also exceptions

2)Form the algorithm in simple english and Do not go for Job development directly

Algorithm:
a)Check whether length of the string is 11 or not
b)If it 11 pass the input else extract right most 10 characters and append with zero

3)Now convert the Algorithm into Datastage stages

a)To implement first point in the Algorithm we need to use Len function and also If statement
b)To implement second point in the Algorithm we need to use Right function

4)In most of the cases requirement is Spilted into Jobs and we have to identity the stages used in each Job after forming the algorithm  which makes you debugging easier instead of implementing entire requirement in one Job.

To identify when to use which stage the following link will  help you-when-to-use-which-stage-in-datastage

5)Performance tuning if you are facing any problem which is an iterative approach.

6)Finally connect all the Jobs through Job sequence depends on Dependency because one Datastage Job output may depends upon the Input of other Datstage Job

Wednesday, April 17, 2013

Difference between sequential file stage and Data set stage?


1) When you use sequential file as Source, at the time of Compilation it will convert to native format from ASCII.where as, when you go for using datasets conversion is not required. Also, by default sequential files we be Processed in sequence only. Sequential files can accommodate up to 2GB only. Sequential  files does not support NULL values.All the above can me overcome using dataset Stage,but selection is depends on the Requirement.suppose if you want to capture rejected data in that case you need to use sequential file or file set stage.

2)Sequential file is used to Extract the data from flat files and load the data into flat files and limit is 2GB.Dataset is a intermediate stage and it has parallelism when load data into dataset and it improve the performance.

3)Data set mainly consists of two files.

a)Descriptor file which consists of Metada,data location but not actual data itself
b)Data file contains the data in multiple files and one file file per partition.

4)orchadmin command is used to delete the datasets where as rm unix command is used to remove the flat files.

Complete information about orchadmin can be found in the below
link-orchadmin

Difference between server jobs and parallel jobs?


Server jobs:-
1)In server jobs it handles less volume of data.
2)It is having less number of components.
3)Data processing will be slow.
4)Executed by Datastage serve engine
5)compiled into Basic
6)No parallel Capability  is one of the  drawback of Server jobs

Parallel jobs:-
1)It handles high volume of data.
2)Executed by Datastage parallel engine
3)It is having more number of components compared to server jobs.
4)Supports  pipeline and partition parallelsim
5)Compiled into OSH

Similarity Between Server jobs and Parallel jobs:

Runtime monitoring in DataStage Director

Tuesday, April 16, 2013

COLUMN GENERATOR STAGE



1)The Column Generator Stage  is a development/debug stage.

2)It can have a single  input  link and  a  single  output  link.

3)The  Column  Generator  adds  columns  to  incoming  data  and generates mock data for these columns for each data row processed.This is useful for testing a job when no real test data is available. 

ROW GENERATOR STAGE


1)The Row Generator Stage  is a development/debug stage. The stage has no  input  links and only one output link.  

2)The row generator stage generates a set of mock data fitting the specified metadata. This is useful for testing a job when no real test data is available. 

3)Metadata can be specified using a schema file.

Sort stage

1)Sort  stage  is  a  processing  stage  used  to  perform  sorting  operations  on  input  data.

2)Need for Sorting:
Some stages require sorted input 
ex- Join, merge stages

3)Sort stage requires a ‘key’ to be specified by which the sort is performed.Multiple sort keys can be specified

4)Sort  operation  is  performed  partition wise.To sort  a complete set  of  data, you should  change  the Sort Stage  execution mode  to  sequential. 

5)There are two ways Sort can be performed in Datastage : 

a)Within stages On input link Partitioning tab, set partitioning to anything other than Auto 

b)In a separate Sort stage which has more options like Allow duplicates,case sensitive,sort order(ascending / descending) etc.

By default, both methods use the tsort operator which can be identified in Job score

6)Partitioning keys are often different than Sorting keys 

Keyed partitioning (e.g.Hash) is used to group related records into the same partition  where as Sort keys are used to establish order within each partition

Modify stage


1)Modify stage is a processing stage that alters the record schema of the input data

2)Modify stage  can have a single input and a single output link. 

3)Modify stage can also be used to handle NULL values, string, date, time and timestamp manipulation functions.

4)Modify stage  is a native parallel stage and has performance benefits over the Transformer stage

Sunday, April 14, 2013

Funnel stage

1)The Funnel stage is a processing stage that combines multiple inputs into a single output. It can have any number of inputs and a single output link. 

2)The metadata for all input data sets must be identical.

3)Funnel can be operated in three Modes and default Mode is Continuous

a)Continuous: 

1)Combines the records of the input link in no guaranteed order. 

2)It takes one record from each input link in turn.  If data is not available on an input link, the stage skips to the next link rather than waiting. 

b)Sort Funnel:
Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys. 

c)Sequence:  Copies all records from the first input link to the output link, then all the records from the second input link and so on.

How to Use different Configuration file for different Job activities in a Job sequence?


Define a APT_CONFIG_FILE  Environmental variable at Job level and mention the path names of configuration files to the jobs in the Job activities in a Job sequence.



Merge stage


1)Merge stage is a processing stage, which can have any number of input links and one output link, with same number of reject links as there are update links.

2)The  input datasets  to  the Merge  stage must  be  key  partitioned  and  sorted.This ensures  that  rows  with  the  same  key  column  values  are   located  in  the  same partition and will be processed by the same node.

3)Merge stage combines master data with one or more updates  link data where  the keys match. 

4)Master and update links must have duplicate free data to ensure proper results.If the input data is not duplicate-free, the output generated will be improper.

5)Check link ordering to make sure the master and update links are proper otherwise the output generated will be improper

Lookup Stage


1)The  ‘Lookup’ stage  is a processing stage  that can have more  than one  input  links and one output  link, as well as one  reject  link.  

2)Lookup Failure options 
Continue, Drop, Fail, Reject 

3)If the lookup fails to find a matching key column, one of these actions can be taken: 

Fail: the lookup Stage reports an error and the job fails immediately. 
This is the default. 

Drop:The input row with the failed lookup(s) is dropped 

Continue:The input row is transferred to the output, together with the successful table entries.The failed table entry(s) are not transferred, resulting in either default output values or null output values depends on datatype. 

Reject:The input row with the failed lookup(s) is transferred to a second output link, the reject  link.

4)Sparse lookup can be used if the input data is smaller than the reference data.

5)Joins have better performance when the reference data is huge. Avoid lookups in such cases. 

6)Lookup Stage does not need sorted input data where as for Join stage and Merge stage input data should be sorted.

7)Please find the below link to find the difference between Normal Lookup and sparse Lookup-NormalvsSparse

Join Stage


1)The  ‘Join’ stage  is a processing stage  that performs a  join operation on  two or more  input data sets and then provides output in the form of one resultant data set. 

2)The following four types of joins can be performed: 
•  Inner join 
•  Left outer join 
•  Right outer join 
•  Full outer join 

3)The  join stage supports 2 or more sorted input links and 1 output link

4)The  join stage editor allows you  to specify  the keys on which  join  is performed. More  than one key can be specified. Specified keys should have same name on all links. 

5)No fail/reject option for missed matches 

6)Link ordering is very important while using left or right outer join and also the input data on all links to join stage should be sorted.

 Capturing unmatched records from a Join:
a)The Join stage does not provide reject handling for unmatched records.If unmatched rows must be captured,an OUTER join operation must be performed,so that when a match does not occur, the Join stage inserts Null value into the unmatched non-key columns provided non-key column is defined as nullable on the Join input links.

b)After Join  Stage Use Transformer to filter Null records with the help of IsNull Built function

Filter stage

1)Filter stage  is used to transform (not modify) records from the input link based on a specific condition

2)Different criteria/requirements can be  specified  to direct data  to different output  links  from  the Filter stage.

3)This processing stage can have a single input link and any number of output links. It can also, optionally, have a reject link.

4)Switch stage supports only 128 output links whereas Filter stage can have any number of output links

Friday, April 12, 2013

When to use which stage in Datastage?

Copy STAGE-To drop a Particular column

Sort STAGE-sorting,generating Key change and similar to order by clause in oracle

Filter STAGE-Similar to where clause in oracle but we can not perform Join operation

Lookup,Join,Merge-To perform Join operation

Pivot Enterprise STAGE-Rows to columns and columns to Rows

External Filter STAGE-Filter the records by using Unix filter commands like Grep etc

MODIFY STAGE-Metadata conversion,Null Handling and similar to conversion functions in oracle

FUNNEL STAGE -Combining the multiple input data into a single output.Metadata should be same for all the inputs

REMOVE DUPLICATES STAGE-To  remove  duplicate  values  from  a single sorted  input.

ENCODE / DECODE STAGES:To encode/compress a data set using UNIX encoding  commands like gzip etc

TRANSFORMER STAGE:
a)Filtering the Data(constraints)
b)Metadata conversion(Using Functions)
c)Rows to columns and columns to Rows(Using Stage variables)
d)Looping
e)Creating a counter(Using macros)-Counter using Transformer

SURROGATE KEY GENERATOR STAGE-To generate SURROGATE KEYs similar to oracle Database sequence

Aggregator Stage:To perform Group by Operations like max,min etc similar to Group by clause in oracle

ROW GENERATOR STAGE:To generate a set of mock data fitting the specified metadata when no real data is available

XML OUTPUT STAGE -To convert tabular data such as tables and sequential files to XML hierarchical structures.

SWITCH STAGE-  It performs an operation similar to  the  switch  statement  in  C and to filter the data

CHANGE CAPTURE STAGE-To identify Delta changes(inserts,updates,deletes etc) between two sources

oracle connector-To connect to the oracle Database.

Aggregator stage

1)Aggregator  stage  is  a  processing  stage  that  can  have  one  input  and  one  output  link.  It classifies  records  from  the  input  link  into   groups  and  computes  the  totals  or  performs specified aggregator  functions  for each group

2)Records can be grouped on one or more keys

3)In parallel environment, we need to be careful when partitioning. It can  affect the result of the aggregator.  If  the  records  that  fall  in  the same  group  are  in  different  partitions,  then  the generated output will be wrong.Therefore,  it  is better  to do Hash partition on grouping keys  before  the  aggregator  stage so that records with same keys will go to same partition.

4)In Aggregator two grouping methods(Hash and sort) are present.Please find the following link-Grouping Methods for more information about grouping methods in aggregator stage

Copy Stage


1)Copy stage  is  a  processing  stage  that  can  have  a  single  input and  any  number  of  output links

2)Copy stage  is  used  to  copy  a  single  input  data  set  to  a  number  of  output  data  sets

3)This stage is generally used for following things

a)Columns can be dropped. 
b)The order of the columns can be altered.