Showing posts with label Datastage-Stages. Show all posts
Showing posts with label Datastage-Stages. Show all posts

Wednesday, April 17, 2013

Difference between sequential file stage and Data set stage?


1) When you use sequential file as Source, at the time of Compilation it will convert to native format from ASCII.where as, when you go for using datasets conversion is not required. Also, by default sequential files we be Processed in sequence only. Sequential files can accommodate up to 2GB only. Sequential  files does not support NULL values.All the above can me overcome using dataset Stage,but selection is depends on the Requirement.suppose if you want to capture rejected data in that case you need to use sequential file or file set stage.

2)Sequential file is used to Extract the data from flat files and load the data into flat files and limit is 2GB.Dataset is a intermediate stage and it has parallelism when load data into dataset and it improve the performance.

3)Data set mainly consists of two files.

a)Descriptor file which consists of Metada,data location but not actual data itself
b)Data file contains the data in multiple files and one file file per partition.

4)orchadmin command is used to delete the datasets where as rm unix command is used to remove the flat files.

Complete information about orchadmin can be found in the below
link-orchadmin

Tuesday, April 16, 2013

COLUMN GENERATOR STAGE



1)The Column Generator Stage  is a development/debug stage.

2)It can have a single  input  link and  a  single  output  link.

3)The  Column  Generator  adds  columns  to  incoming  data  and generates mock data for these columns for each data row processed.This is useful for testing a job when no real test data is available. 

ROW GENERATOR STAGE


1)The Row Generator Stage  is a development/debug stage. The stage has no  input  links and only one output link.  

2)The row generator stage generates a set of mock data fitting the specified metadata. This is useful for testing a job when no real test data is available. 

3)Metadata can be specified using a schema file.

Sort stage

1)Sort  stage  is  a  processing  stage  used  to  perform  sorting  operations  on  input  data.

2)Need for Sorting:
Some stages require sorted input 
ex- Join, merge stages

3)Sort stage requires a ‘key’ to be specified by which the sort is performed.Multiple sort keys can be specified

4)Sort  operation  is  performed  partition wise.To sort  a complete set  of  data, you should  change  the Sort Stage  execution mode  to  sequential. 

5)There are two ways Sort can be performed in Datastage : 

a)Within stages On input link Partitioning tab, set partitioning to anything other than Auto 

b)In a separate Sort stage which has more options like Allow duplicates,case sensitive,sort order(ascending / descending) etc.

By default, both methods use the tsort operator which can be identified in Job score

6)Partitioning keys are often different than Sorting keys 

Keyed partitioning (e.g.Hash) is used to group related records into the same partition  where as Sort keys are used to establish order within each partition

Modify stage


1)Modify stage is a processing stage that alters the record schema of the input data

2)Modify stage  can have a single input and a single output link. 

3)Modify stage can also be used to handle NULL values, string, date, time and timestamp manipulation functions.

4)Modify stage  is a native parallel stage and has performance benefits over the Transformer stage

Sunday, April 14, 2013

Funnel stage

1)The Funnel stage is a processing stage that combines multiple inputs into a single output. It can have any number of inputs and a single output link. 

2)The metadata for all input data sets must be identical.

3)Funnel can be operated in three Modes and default Mode is Continuous

a)Continuous: 

1)Combines the records of the input link in no guaranteed order. 

2)It takes one record from each input link in turn.  If data is not available on an input link, the stage skips to the next link rather than waiting. 

b)Sort Funnel:
Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys. 

c)Sequence:  Copies all records from the first input link to the output link, then all the records from the second input link and so on.

Merge stage


1)Merge stage is a processing stage, which can have any number of input links and one output link, with same number of reject links as there are update links.

2)The  input datasets  to  the Merge  stage must  be  key  partitioned  and  sorted.This ensures  that  rows  with  the  same  key  column  values  are   located  in  the  same partition and will be processed by the same node.

3)Merge stage combines master data with one or more updates  link data where  the keys match. 

4)Master and update links must have duplicate free data to ensure proper results.If the input data is not duplicate-free, the output generated will be improper.

5)Check link ordering to make sure the master and update links are proper otherwise the output generated will be improper

Lookup Stage


1)The  ‘Lookup’ stage  is a processing stage  that can have more  than one  input  links and one output  link, as well as one  reject  link.  

2)Lookup Failure options 
Continue, Drop, Fail, Reject 

3)If the lookup fails to find a matching key column, one of these actions can be taken: 

Fail: the lookup Stage reports an error and the job fails immediately. 
This is the default. 

Drop:The input row with the failed lookup(s) is dropped 

Continue:The input row is transferred to the output, together with the successful table entries.The failed table entry(s) are not transferred, resulting in either default output values or null output values depends on datatype. 

Reject:The input row with the failed lookup(s) is transferred to a second output link, the reject  link.

4)Sparse lookup can be used if the input data is smaller than the reference data.

5)Joins have better performance when the reference data is huge. Avoid lookups in such cases. 

6)Lookup Stage does not need sorted input data where as for Join stage and Merge stage input data should be sorted.

7)Please find the below link to find the difference between Normal Lookup and sparse Lookup-NormalvsSparse

Join Stage


1)The  ‘Join’ stage  is a processing stage  that performs a  join operation on  two or more  input data sets and then provides output in the form of one resultant data set. 

2)The following four types of joins can be performed: 
•  Inner join 
•  Left outer join 
•  Right outer join 
•  Full outer join 

3)The  join stage supports 2 or more sorted input links and 1 output link

4)The  join stage editor allows you  to specify  the keys on which  join  is performed. More  than one key can be specified. Specified keys should have same name on all links. 

5)No fail/reject option for missed matches 

6)Link ordering is very important while using left or right outer join and also the input data on all links to join stage should be sorted.

 Capturing unmatched records from a Join:
a)The Join stage does not provide reject handling for unmatched records.If unmatched rows must be captured,an OUTER join operation must be performed,so that when a match does not occur, the Join stage inserts Null value into the unmatched non-key columns provided non-key column is defined as nullable on the Join input links.

b)After Join  Stage Use Transformer to filter Null records with the help of IsNull Built function

Filter stage

1)Filter stage  is used to transform (not modify) records from the input link based on a specific condition

2)Different criteria/requirements can be  specified  to direct data  to different output  links  from  the Filter stage.

3)This processing stage can have a single input link and any number of output links. It can also, optionally, have a reject link.

4)Switch stage supports only 128 output links whereas Filter stage can have any number of output links

Friday, April 12, 2013

When to use which stage in Datastage?

Copy STAGE-To drop a Particular column

Sort STAGE-sorting,generating Key change and similar to order by clause in oracle

Filter STAGE-Similar to where clause in oracle but we can not perform Join operation

Lookup,Join,Merge-To perform Join operation

Pivot Enterprise STAGE-Rows to columns and columns to Rows

External Filter STAGE-Filter the records by using Unix filter commands like Grep etc

MODIFY STAGE-Metadata conversion,Null Handling and similar to conversion functions in oracle

FUNNEL STAGE -Combining the multiple input data into a single output.Metadata should be same for all the inputs

REMOVE DUPLICATES STAGE-To  remove  duplicate  values  from  a single sorted  input.

ENCODE / DECODE STAGES:To encode/compress a data set using UNIX encoding  commands like gzip etc

TRANSFORMER STAGE:
a)Filtering the Data(constraints)
b)Metadata conversion(Using Functions)
c)Rows to columns and columns to Rows(Using Stage variables)
d)Looping
e)Creating a counter(Using macros)-Counter using Transformer

SURROGATE KEY GENERATOR STAGE-To generate SURROGATE KEYs similar to oracle Database sequence

Aggregator Stage:To perform Group by Operations like max,min etc similar to Group by clause in oracle

ROW GENERATOR STAGE:To generate a set of mock data fitting the specified metadata when no real data is available

XML OUTPUT STAGE -To convert tabular data such as tables and sequential files to XML hierarchical structures.

SWITCH STAGE-  It performs an operation similar to  the  switch  statement  in  C and to filter the data

CHANGE CAPTURE STAGE-To identify Delta changes(inserts,updates,deletes etc) between two sources

oracle connector-To connect to the oracle Database.

Aggregator stage

1)Aggregator  stage  is  a  processing  stage  that  can  have  one  input  and  one  output  link.  It classifies  records  from  the  input  link  into   groups  and  computes  the  totals  or  performs specified aggregator  functions  for each group

2)Records can be grouped on one or more keys

3)In parallel environment, we need to be careful when partitioning. It can  affect the result of the aggregator.  If  the  records  that  fall  in  the same  group  are  in  different  partitions,  then  the generated output will be wrong.Therefore,  it  is better  to do Hash partition on grouping keys  before  the  aggregator  stage so that records with same keys will go to same partition.

4)In Aggregator two grouping methods(Hash and sort) are present.Please find the following link-Grouping Methods for more information about grouping methods in aggregator stage

Copy Stage


1)Copy stage  is  a  processing  stage  that  can  have  a  single  input and  any  number  of  output links

2)Copy stage  is  used  to  copy  a  single  input  data  set  to  a  number  of  output  data  sets

3)This stage is generally used for following things

a)Columns can be dropped. 
b)The order of the columns can be altered.

Saturday, February 9, 2013

Difference between Normal lookup and Sparse lookup?

1)The first input link to lookup stage is called the ‘Primary’ link. Other links   are called ‘Lookup’ links. When  lookup  links are  from a stage  that  is other  than a database stage, all data from the  lookup  link  is read  into memory. Then,  for each row  from  the primary  link,  the  lookup  is performed. If the source of lookups is a database, there can be two types of lookups: 

 Normal lookup: 
All the data from the database is read into memory, and then lookup is performed. 

Sparse  lookup:  For  each  incoming  row  from  the  primary  link,  the  SQL  is  fired  on database at run time.

2)Sparse lookups can be used if the input data is smaller than the reference data.

Tuesday, February 5, 2013

Difference between Hash and sort grouping methods in Aggregator stage


Grouping Methods
Hash (default)
1)Calculations are made for all groups and stored in memory
2)Results are written out after all input has been processed so large memory is required when volume of input is high
3)Input does not need to be sorted
4)Useful when the number of unique groups is small 
Sort
1)Requires the input data to be sorted by grouping keys
2)Only a single aggregation group is kept in memory so less memory is required
3)When a new group is seen, the current group is written out
4)Can handle unlimited numbers of groups

Conclusion-When the volume of input is high  and is not predictable it is better to use Sort Method

Thursday, December 13, 2012

What is External Filter Stage?

1)External filter stage is a processing stage.
2)In Datastage User can filter the Data using Unix Commands(sed,cut, cat, grep, head etc) with the help of External filter stage.
External filter stage allows us to run these commands during processing the data in the job
Job:
Now I want to use grep command to filter input file(employee file-emp.txt)  with DEPTNO=10 

Command : grep  "10$"

Click the links below for input file(employee file-emp.txt) and dsx file for the Job-J_ExternalFilter

The Source File-emp.txt and dsx file for the Job-J_ExternalFilter

Job:

Tuesday, December 4, 2012

How to abort the Job based on Condition?

I/p

col 
100 
200 
300 
100 
100 

If any records come other than 100 in col column, I need to abort the job. 

We can use Transfomer to abort the Job based upon certain condition

If Inlink.col<>"100" Then 'Y' Else 'N'=StgVar

Constraint: 

StageVar='Y'

Abort After rows=1