19/06/2013

[PL/SQL] Read, parse and store data from an XML file

Suppose you have a fairly complex XML file with this structure:

 <myRoot>  
      <myElement>  
           <aNode>  
                <aData>123</aData>  
                <bData><![CDATA[some lengthy text with special <characters>in</it>]]></bData>  
           </aNode>   
           <otherNode>  
                <otherData>31/12/1900</otherData>  
                .  
                .  
                .  
                <otherData>01/01/2000</otherData>  
           </otherNode>  
           <lastNode>  
                <simpleTag>  
                     <simpleValue1>some text</simpleValue1>  
                     <simpleValue2>other text</simpleValue2>  
                     <simpleFilter>0</simpleFilter>  
                </simpleTag>  
                <simpleTag>  
                     <simpleValue1>blah blah</simpleValue1>  
                     <simpleValue2>other blah</simpleValue2>  
                     <simpleFilter>1</simpleFilter>  
                </simpleTag>  
                <simpleTag>  
                     <simpleValue1>foo</simpleValue1>  
                     <simpleValue2>bar</simpleValue2>  
                     <simpleFilter>1</simpleFilter>  
                </simpleTag>  
                .  
                .  
                .  
           </lastNode>  
      </myElement>  
      .  
      .  
      .  
 </myRoot>  


Also, we won't have a proper schema file but we know that it'll have one or more myElement nodes, for each of which it'll have:
  • a single aNode node with an integer aData element and a CDATA bData text element
  • a single otherNode node with multiple otherData elements representing dates
  • a single lastNode node with multiple simpleTag child nodes, each composed of three elements: a string in both simpleValue1 and simpleValue2 and an integer in simpleFilter. Additionally, we also know that for every lastNode node, just one and only one simpleTag node will have its simpleFilter element set to 0.

Now you want to store it in a table, then read it, parse its contents and save them in another table.

The table where you'll save it is defined as:

CREATE TABLE myTable(
    myId          INTEGER;
    myXML     CLOB;
    filename    VARCHAR2(200);
    ts_ins        DATE;--insert timestamp
    CONSTRAINT myTable_pk PRIMARY KEY (myId)
);


And of course, you'll have a SEQUENCE for its primary key:

CREATE OR REPLACE SEQUENCE myTable_seq
MINVALUE 1
MAXVALUE 999999--it can go much higher
START WITH 1
INCREMENT BY 1
CACHE 10;--unless you expect multiple concurrent accesses, it's pretty useless


The XML file will be read from a DIRECTORY:

CREATE OR REPLACE DIRECTORY WORK_DIR AS 'C:\temp';--or whatever

Note that the directory MUST be on the same machine where the Oracle instance is running, to access remote directories, you must first map it as a network directory on the machine running the instance and then treat it as a local folder.

We'll also have a simple log table:

CREATE TABLE myLog(
    errmsg      VARCHAR2(2000);
    ts          DATE;
);


Into which we'll write using a procedure. To ensure that our log procedure only commits its data, we'll define it as PRAGMA AUTONOMOUS TRANSACTION, which encloses said procedure in a separate transaction so that it cannot commit on the main transaction and is not rolled back from it either.

Finally, the parsed data will be stored in yet another table. Furthermore, since the XML node otherNode has one or more children, we'll store it as a NESTED TABLE; to do so, we'll have to define a new datatype:

CREATE OR REPLACE TYPE t_myDates AS TABLE OF DATE;--of course we could use more complex data structures here

then the table:

CREATE TABLE myData(
    myId        INTEGER;
    aData       INTEGER;
    bData        VARCHAR2(200);
    otherData        t_myDates;

    simpleValue   VARCHAR2(200);
    CONSTRAINT myData_pk PRIMARY KEY (myId)
)

NESTED TABLE otherData STORE AS myData_otherData_NT;

and its sequence:

CREATE OR REPLACE SEQUENCE myData_seq
MINVALUE 1
MAXVALUE 999999
START WITH 1
INCREMENT BY 1
CACHE 10;


Now we'll define all our functions/procedures in a PACKAGE myPackage with the following package specification:

 CREATE OR REPLACE PACKAGE myPackage  
 IS  
   
 FUNCTION readXML (FileNameIn IN VARCHAR2)  
      RETURN PLS_INTEGER;  
        
 PROCEDURE writeLog(MessageIN IN varchar2);  
   
 FUNCTION parseData (IdIn IN INTEGER)  
      RETURN PLS_INTEGER;  
        
 END;  


All methods defined inside the package specification are publicly accessible from outside the package, moreover, by defining the function parseData inside the package specification we are able to call it from within the body of readXML without having to resort to the forward declaration or defining the parseData method BEFORE readXML. In any other case, the compiler would complain that it does not know of a "parseData" object.

and package body:

 CREATE OR REPLACE PACKAGE BODY myPackage  
 IS  
   
      STD_DEF_DATE varchar2(10):='dd/mm/yyyy';  
   
 FUNCTION readXML (FileNameIn IN VARCHAR2)  
      RETURN PLS_INTEGER  
 IS  
      dest_clob  CLOB;  
      src_clob  BFILE := BFILENAME('WORK_DIR', FileNameIn); --from the LOCAL (Oracle instance) filesystem  
      dst_offset number := 1 ;  
      src_offset number := 1 ;  
      lang_ctx  number := DBMS_LOB.DEFAULT_LANG_CTX;  
      warning   number;  
      l_id                myTable.myId%type;  
      l_parseDataRes      PLS_INTEGER;  
   
 BEGIN  
   
      SAVEPOINT svp_ini;  
   
      --get a new id from our sequence  
      SELECT myTable_seq.NEXTVAL INTO l_id FROM DUAL;  
   
      --put some data in the table along an empty CLOB which we'll fill later  
      INSERT INTO myTable(myId,  
           myXML,  
           filename,  
           ts_ins)  
      VALUES(l_id,   
           empty_clob(),   
           FileNameIn,   
           SYSDATE)  
      RETURNING myXML INTO dest_clob;  
        
      --now open our file in read mode   
      DBMS_LOB.OPEN(src_clob, DBMS_LOB.LOB_READONLY);  
        
      --and copy its contents in our CLOB  
      DBMS_LOB.LOADCLOBFROMFILE(dest_lob => dest_clob,  
           src_bfile => src_clob,  
           amount => DBMS_LOB.GETLENGTH(src_clob),  
           DEST_OFFSET => dst_offset ,  
           SRC_OFFSET  => src_offset,  
           BFILE_CSID  => DBMS_LOB.DEFAULT_CSID,  
           LANG_CONTEXT => lang_ctx,  
           WARNING   => warning);  
        
      --remember to close the file!  
      DBMS_LOB.CLOSE(src_clob);  
   
      COMMIT;  
        
      l_parseDataRes:=parseData(l_id);  
   
      RETURN 1;  
        
 EXCEPTION WHEN OTHERS THEN  
      IF DBMS_LOB.ISOPEN(src_clob) = 1 THEN--if the file is still open close it!  
           DBMS_LOB.CLOSE(src_clob);  
      END IF;  
        
      --log the event  
      writelog('Error: '||SQLERRM||' while reading '||FileNameIn);  
      ROLLBACK TO svp_ini;  
      RETURN 0;  
 END readXML;  
   
 PROCEDURE writeLog(MessageIN IN varchar2) IS PRAGMA AUTONOMOUS_TRANSACTION; --to prevent unwanted commits!!!  
 BEGIN  
   INSERT INTO myLog VALUES(MessageIn, SYSDATE);  
   COMMIT;  
 END writeLog;  
   
 FUNCTION parseData (IdIn IN INTEGER)  
      RETURN PLS_INTEGER  
 IS  
      l_XML                    XMLTYPE;  
      l_XML_content     CLOB;  
      l_aData                    myData.aData%type;  
      l_bData                    myData.bData%type;  
      l_otherData          myData.otherData%type;  
      l_simpleValue     myData.simpleValue%type;  
      l_myDataId          PLS_INTEGER;  
      l_countSimpleFilter     PLS_INTEGER;  
 BEGIN  
   
 --get the XML  
 SELECT myXML  
 INTO l_XML_content  
 FROM myTable  
 WHERE myId = IdIn;  
   
   
 l_XML := XMLType.CreateXML (l_XML_content);  
   
 --loop on our XML  
   
 FOR myCur IN (  
      SELECT myElement."aData", myElement."bData", myElement."otherNode", myElement."lastNode" --note the "" around the column names!  
      FROM XMLTABLE('//myElement' --creates a temporary table from a given XML with the structure specified  
           PASSING l_XML  
           COLUMNS "aData" INTEGER PATH '//aNode/aData', /*path defines the full (starting with a single /) or relative   
           (starting with a double //) path where the data is located inside the XML. It would be equivalent to '/myRoot/myElement/aNode/aData' */  
                "bData"     VARCHAR2(200) PATH '//aNode/bData/text()', --we use text() since it's a CDATA value  
                "otherNode" XMLTYPE PATH '//otherNode', --we'll cycle on that later  
                "lastNode" XMLTYPE PATH '//lastNode' --we'll cycle on that later  
           ) myElement  
 )LOOP --we are now looping on each myElement element inside our XML  
      BEGIN  
   
           SAVEPOINT svp_parsedata;  
   
           l_aData:=myCur."aData";  
           l_bData:=myCur."bData";  
   
           --we now want to read and concatenate the two simpleValue elements from the simpleTag node where simpleFilter is set to 0  
             
           --suppose you don't trust that there's just one simpleFilter with value 0, you can easily check it  
           SELECT COUNT(*)  
           INTO l_countSimpleFilter  
           FROM TABLE (XMLSEQUENCE(EXTRACT(l_XML, '//lastNode/simpleTag'))) t /*using XMLSEQUENCE we extract one by one the simpleTag  
           values and create a collection from them, then we put them in a temporary table with alias t and work on it*/  
           WHERE EXTRACTVALUE(VALUE(t),'//simpleTag/simpleFilter')=0;  
             
           IF l_countSimpleFilter > 1 THEN --if there are more, raise a user defined error to stop parsing the current myElement node  
                RAISE_APPLICATION_ERROR(-20000, 'Found more than one simpleTag with simpleFilter = 0');   
           END IF;  
             
             
           --now we're 100% sure there's only one  
           SELECT EXTRACTVALUE(XMLQUERY( /*remember that we know there will be ONLY ONE   
                                                                            node with element simpleFilter=0, that's why we can use EXTRACTVALUE*/  
                'for $i in $myXML/lastNode/simpleTag  
                where $i/simpleFilter = 0  
                return <simpleValue>{$i/simpleValue1/text()}&#160;{$i/simpleValue2/text()}</simpleValue>'  
                PASSING myCur."lastNode" AS "myXML" RETURNING CONTENT), '//simpleValue/text()', null) /*the use of &#160; is mandatory   
                                                                                                                                                                 if you want a space between the two values*/  
           INTO l_simpleValue /*using XMLQUERY we've queried lastNode for the values we were interested into,   
           knowing that only a single child node would have matched our criteria. Then we had it return a single XML node   
           tagged simpleValue from which we extracted its contents with EXTRACTVALUE*/  
           FROM DUAL;  
   
           --now we're going to loop on the otherNode node to get all its otherData elements  
   
           /*if you're using a more complex nested table object, don't forget to initialize it with the proper constructor!  
           l_otherData:=myComplexObject(); */  
           FOR myData IN (   
                SELECT otherNode."otherData"  
                FROM XMLTABLE('//otherNode'  
                     PASSING myCur."otherNode"  
                     COLUMNS "otherData" VARCHAR2(10) PATH '//otherData'  
                     ) otherNode  
           )LOOP  
                l_otherData.EXTEND;  
                l_otherData(l_otherData.LAST):=TO_DATE(myData."otherData", STD_DEF_DATE); /*convert to date from varchar using the date format specified  
                                                                                                                                                  at the beginning of our package*/  
                /*if your myComplexObject is TABLE OF mySimpleObject, then you would  
                l_otherData(l_otherData.LAST):=mySimpleObject(value1,...,valueN)*/  
           END LOOP;  
             
           --finally store out data inside myData table  
   
           SELECT myData_seq.NEXTVAL INTO l_myDataId FROM DUAL;  
             
           INSERT INTO myData(myId,  
                aData,  
                bData,  
                otherData,  
                simpleValue  
           )  
           VALUES(  
                l_aData,  
                l_bData,  
                l_otherData,  
                l_simpleValue  
           );  
             
           COMMIT;  
        
      EXCEPTION WHEN OTHERS THEN  
           writeLog('Error: '||SQLERRM||' while parsing data');  
           ROLLBACK TO svp_parsedata;  
      END;    
 END LOOP;  
   
 RETURN 1;  
   
 EXCEPTION WHEN OTHERS THEN  
      writeLog('Error: '||SQLERRM||' while reading from XML');  
      RETURN 0;  
 END parseData;  
   
 END myPackage;  


We used many complex objects or functions to create our package, here's what to note:
  • the source XML is an external binary file, a BFILE
  • there's a SYS package that allows us to easily work on LOB objects, DBMS_LOB
  • when manipulating the XML inside our script, we treat it as an XMLTYPE which can be queried with the EXTRACT and EXTRACTVALUE functions. If the XML file has a complex structure, we'll need to use the XMLTABLE and XMLSEQUENCE functions to manipulate it better. Last but not least, don't forget that we can always resort to the XMLQUERY function to query XML data; the downside is that the syntax is slightly different and more complicated than standard SQL
  • should we need to define new exceptions, we can use the RAISE_APPLICATION_ERROR function

1 comment:

  1. nice piece of information, I had come to know about your internet site from my friend vinay, delhi,i have read atleast 12 posts of yours by now, and let me tell you, your website gives the best and the most interesting information. This is just the kind of information that i had been looking for, i'm already your rss reader now and i would regularly watch out for the new post, once again hats off to you! Thanx a ton once again, Regards, sql and plsql difference

    ReplyDelete

With great power comes great responsibility.

Da grandi poteri derivano grandi responsabilità.