Thursday, December 23, 2010

E T L - ENGINE

DEFINITION OF ETL

ETL stands for Extract, Transform, and Load. That is, ETL programs periodically extract data from source systems, transform the data into a consistent format, and then load the data into the target data store. During ETL process the data is extracted from an OLTP database or non-OLTP system and transformed to match the data warehouse schema and finally loaded into the data warehouse database. In its simplest form, ETL is copying a data from one database to another.

When defining an ETL for data warehouse, it is important to think of ETL as a process, not a physical implementation. The way of implementing ETL can vary from data warehouse to data warehouse and even between department data marts within a data warehouse. An ETL system consists of four distinct functional elements:

• Extraction
• Transformation
• Loading
• Meta data

Extraction:
This process extract the data from source systems (source system can be OLTP, legacy, mainframe, flat files, etc) data using adapters, such as ODBC, native SQL formats, or flat file extractors. These adapters consult metadata to determine which data to extract and how.

Transformation:
This process transforms the extracted data into a consistent data warehouse schema by applying the business rules on it. This process is also responsible for data validation, data accuracy, data-type conversion and business rules application. It is most complicated of the ETL elements.

Load:
The ETL Load element is responsible for loading the transformed data into data warehouse using target data adapters such as SQL loader, Bulk process, BCP, etc.

Metadata:
The ETL metadata element is responsible for maintaining the information about the data. The metadata repository makes metadata available to the ETL engine at run time.

HAND-CODED ETL PROCESS:

When ETL tools were not in the picture, developers were using custom codes to perform the ETL operations. The programs written using this method were lengthy and hard to document. The ETL developer has to use different programming languages to perform the ETL task, such as; Perl scripts for extracting the data from source systems, performing transformations, and SQL Loader and PL/SQL Bulk procedures were used to load the data in target warehouse. Also some wrapper scripts like shell scripts were also being used to make these ETL as packaged programs. Writing these custom programs does not seem to be feasible option because there are some serious problems with this method of ETL process.

Strengths & Limitations:
• Strengths:
O Though you have to create the metadata manually in hand-coded ETL systems but still you can more directly manage it by yourself.
O Testing of the ETL code written is easy since many automated unit testing tools are available for hand-coded systems.
O Hand-coded ETL are more flexible, you can do whatever you want without any limitations.

• Limitations:

O While designing a DW application, a provision has to be made to accommodate the changes that should not affect the existing design. To keep up with high volume of changes initiated by end users, hand-written ETL programs have to be continually modified and in many cases rewritten. The effort required to maintain these programs frequently becomes a major burden for the project.
O Metadata is the backbone of any DW project. Hand-coding of ETL requires maintaining of metadata tables separately. Any new changes require changes to metadata tables that have to be done manually.
O Hand-coded ETL programs are likely to have a slower speed of execution since hand-generated programs are typically single-threaded, while modern ETL tools generate multi-threaded, directly executable code that can run on parallel, high-speed engines.

TOOL-BASED ETL PROCESS:

To avoid the overhead caused due to hand-coded ETL process, many vendors have developed ETL tools to perform extraction, transformation and loading process. An important function of these tools is to generate and maintain centralized metadata. As computer systems started to evolve from monolithic mainframes to distributed computing systems, and as business intelligence made its debut, the first ETL solutions were introduced. Initially these tools provided the ability to extract the data from mainframes and load into target database. But the ETL tools have matured enough now-a-days providing user-friendly GUI’s, additional functionalities and performance benefits. The ETL tools may be code-generators or engine-based tools. These graphical tools provide ease of development, by either generating the code or enabling the ETL process through its internal engine. They reduce development and maintenance time and cost. There is still a debate about whether ETL engines or code generators, which have improved since their debut in the mid-1990s, offer the best functionality and performance. Since that time, several generations of ETL have been produced.

Strengths & Limitations:
Strengths:
    • The amount of ETL development time will be drastically reduced.
    • Tool will come with wide range of rich built-in features.
    • The tool can be reused for future phases of the project, or future projects.
    • The ETL rules will be more readable for easy of debugging.
    • Simplicity of ETL process maintenance.

Limitations:
  • Increase time spend for evaluation, selection and learning of the product.
  • The higher cost of purchasing, training and licensing fees might be out of budget.
  • Because is not custom build, might not fit 100% to your needs.
EXAMPLE:1
My Source : dept.txt
My Target : dept table

PROCESS : What ETL Engine does mean it push data’s in the file dept.txt in to database table dept.

For this we are maintaining XL Sheet, it contains information about Source file name, Target table name, Column names, Data types and Rules. Based on these things I am pushing data from dept file into dept table.

In source file the number of columns must be less then or equal to number of columns in the target. Before run ETL Engine we want to create a table based on the XL Sheet.

XL SHEET
File name
table name
Column name
data type
validation rule
default value
c:\dept.txt
dept
deptno
number

if the value is other than 0, insert the same,
if the value is 0 then use the dept_seq.nextval
c:\dept.txt
dept
dname
varchar
length should be atleast 10 characters
unknown
c:\dept.txt
dept
loc
varchar
length should be atleaset 6 characters


SOURCE DATA (FLAT FILE)

Deptno ,dname , loc
10,DEV,CHENNAI
20,MARKETING,BANGALORE
30,PROD SUP,BANGALORE

BEFORE RUN ETL ENGINE:

TARGET DATABASE (ORACLE)

DEPTNO DNAME LOC

AFTER RUN ETL ENGINE:

TARGET DATABASE (ORACLE)

DEPTNO DNAME LOC
10 DEV CHENNAI
20 MARKETING BANGALORE
30 PROD SUP BANGALORE

EXAMPLE:2
My Source : emp.txt
My Target : emp table
PROCESS : What ETL Engine does mean it push data’s in the file emp.txt in to database table dept.
For this we are maintaining XL Sheet, it contains information about Source file name, Target table name, Column names, Data types and Rules. Based on these things I am pushing data from dept file into dept table.
In source file the number of columns must be less then or equal to number of columns in the target. Before run ETL Engine we want to create a table based on the XL Sheet.

XL SHEET
File name
table name
Column name
data type
validation rule
default value
c:\ emp.txt
emp
empno
number

if the value is other than 0, insert the same,
if the value is 0 then use the dept_seq.nextval
c:\ emp.txt
emp
ename
varchar
length should be atleast 10 characters
unknown
c:\ emp.txt
emp
job
varchar
length should be atleaset 6 characters

c:\ emp.txt
emp
mgr
number
length should be atleaset 4 characters

c:\ emp.txt
emp
hiredate
date


c:\ emp.txt
emp
sal
number
length should be atleaset 4 characters

c:\ emp.txt
emp
comm
number
length should be atleaset 4 characters

c:\ emp.txt
emp
deptno
number
length should be atleaset 2 characters


SOURCE DATA (FLAT FILE)

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- -----------------------------
7369 SMITH CLERK 7902 17-DEC-80 800 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81 2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000 10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 23-MAY-87 1100 20
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 23-JAN-82 1300 10



BEFORE RUN ETL ENGINE:

TARGET DATABASE (ORACLE)

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO

AFTER RUN ETL ENGINE:

TARGET DATABASE (ORACLE)

EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO
---------- ---------- --------- ---------- --------- ---------- ---------- -----------------------------
7369 SMITH CLERK 7902 17-DEC-80 800 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81 2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000 10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 23-MAY-87 1100 20
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 23-JAN-82 1300 10

CODING

create or replace procedure proc1(file_name varchar2) is
v_sql varchar2(54);
v_fileid number(5);
v_count number(5);
x number(5):=1;
y number(5):=1;
z number(5):=1;
type t_id is table of varchar2(15) index by binary_integer;
v_tableid t_id;
v_tablename t_id;
v_columnid t_id;
v_columnname t_id;
v_columndt t_id;
v_columnlen t_id;
F1 UTL_FILE.FILE_TYPE;
v varchar2(100);
hed varchar2(100);
v_columnhed varchar2(100);
l_in number(5):=0;
a number(5);
b number(5);
c number(5);
l_column number(5);
v_columnrec varchar2(53);
ma varchar2(120);
quo varchar2(4):=chr(39);
dt varchar2(153);
v_crt varchar2(53);
v_ins varchar2(153);
begin
select fileid into v_fileid from file1 where filename=file_name;
dbms_output.put_line('file id'||v_fileid);
for i in (select tableid,tablename from table1 where fileid=v_fileid ) loop
v_tableid(x):=i.tableid;
v_tablename(x):=i.tablename;
dbms_output.put_line('table id '||v_tableid(x));
dbms_output.put_line('table name '||v_tablename(x));
x:=x+1;
end loop;
for i in (select columnid,columnname,columndatatype,columnlength from column1
where tableid=v_tableid(y)) loop
v_columnid(y):=i.columnid;
v_columnname(y):=i.columnname;
v_columndt(y):=i.columndatatype;
v_columnlen(y):=i.columnlength;
dbms_output.put_line('column id '||v_columnid(y));
dbms_output.put_line('column name '||v_columnname(y));
dbms_output.put_line('column datatype '||v_columndt(y));
dbms_output.put_line('column length '||v_columnlen(y));
y:=y+1;
end loop;
for i in 1..x-1 loop
for j in 1..y-1 loop
if v_columndt(j)='N' then
dt:=' NUMBER';
elsif v_columndt(j)='D' then
dt:=' DATE';
else
dt:=' varchar2';
end if;
if j=1 then
v_crt:='create table '||v_tablename(i)||'('||v_columnname(j)||
' '||dt||'('||v_columnlen(j)||'))';
-- execute immediate v_crt;
dbms_output.put_line(v_crt);
else
v_crt:='alter table '||v_tablename(i)||' add '|| v_columnname(j)||' '||
dt||'('||v_columnlen(j)||'))';
-- execute immediate v_crt;
dbms_output.put_line(v_crt);
end if;
end loop;
end loop;
F1:=UTL_FILE.FOPEN('MANI',file_name,'R');
UTL_FILE.GET_LINE(F1,V);
l_column:=length(v)-length(replace(v,',',''));
v:=v||',';
a:=1;
b:=instr(v,',',1,1)-1;
for i in 1..l_column+1 loop
v_columnhed:=substr(v,a,b);
a:=b+a+1;
c:=i+1;
b:=instr(v,',',1,i+1)-a;
hed:=hed||v_columnhed||',';
end loop;
hed:=substr(hed,1,length(hed)-1);
loop
UTL_FILE.GET_LINE(F1,V);
begin
IF V IS NOT NULL THEN
y:=1;
ma:=null;
v:=v||',';
a:=1;
b:=instr(v,',',1,1)-1;
for i in 1..l_column+1 loop
v_columnrec:=substr(v,a,b);
a:=b+a+1;
c:=i+1;
b:=instr(v,',',1,i+1)-a;
if v_columndt(y)='N' then
ma:=ma||v_columnrec||',';
else
ma:=ma||''''||v_columnrec||''''||',';
end if;
y:=y+1;
end loop;
ma:=substr(ma,1,length(ma)-1);
if z=x-1 then
v_ins:='insert into '||v_tablename(1)||' ('||hed||')values('||ma||')';
-- execute immediate v_ins;
dbms_output.put_line(v_ins);
end if;
END IF;
EXCEPTION
WHEN NO_DATA_FOUND THEN
null;
EXIT;
end;
END LOOP;
EXCEPTION
WHEN NO_DATA_FOUND THEN
null;
commit;
end;
/






No comments:

Post a Comment