Author Topic: XML Parsing  (Read 8123 times)

JRS

  • Guest
XML Parsing
« on: November 30, 2010, 09:56:26 PM »
I have to interface with a real estate web service and was looking at options to extract the XML data and prepare it for INSERT/UPDATE to a SQL database using ScriptBasic. This snippet of code only returns tags with data. (skipping header/footer and empty fields)

Anyone up for a code challenge to try and create a smaller or more efficient XML parser?

xmlparse.sb
Code: [Select]
IMPORT t.bas

xml = t::LoadString("sample.xml")

SPLITA xml BY ">" TO a

FOR x = 0 TO UBOUND(a)
  IF LEFT(TRIM(a[x]),1) = "<" THEN GOTO IT
  p = INSTR(a[x],"<")
  PRINT MID(a[x],p+2) & " = " & LEFT(a[x],p-1),"\n"
IT:
NEXT


Example XML residential listing
Code: XML
  1. <Listings xmlns="http://www.nwmls.com/Schemas/Standard/StandardXML1_1.xsd">
  2.         <Residential>
  3.                 <LN>62981</LN>
  4.                 <PTYP>RESI</PTYP>
  5.                 <LAG>27022</LAG>
  6.                 <ST>CT</ST>
  7.                 <LP>599950.00</LP>
  8.                 <SP>0.00</SP>
  9.                 <OLP>599950.00</OLP>
  10.                 <HSN>4538</HSN>
  11.                 <DRP/>
  12.                 <STR>23rd</STR>
  13.                 <SSUF>Ave</SSUF>
  14.                 <DRS>SW</DRS>
  15.                 <UNT/>
  16.                 <CIT>Seattle</CIT>
  17.                 <STA>WA</STA>
  18.                 <ZIP>98126</ZIP>
  19.                 <PL4/>
  20.                 <BR>5.00</BR>
  21.                 <BTH>3.50</BTH>
  22.                 <ASF>3650</ASF>
  23.                 <LSF>4800</LSF>
  24.                 <UD>2010-04-24 14:59:25</UD>
  25.                 <AR>140</AR>
  26.                 <DSRNUM>7215</DSRNUM>
  27.                 <LDR>2010-04-24 00:00:00</LDR>
  28.                 <LD>2010-04-24 00:00:00</LD>
  29.                 <CLO>1800-01-01 00:00:00</CLO>
  30.                 <YBT>2010</YBT>
  31.                 <LO>1401</LO>
  32.                 <TAX>1773600264</TAX>
  33.                 <MAP>594</MAP>
  34.                 <GRDX>G</GRDX>
  35.                 <GRDY>4</GRDY>
  36.                 <SAG>0</SAG>
  37.                 <SO>0</SO>
  38.                 <NIA>Y</NIA>
  39.                 <MR>Third of three New Contemporary Homes w/fantastic open floor plans and great level and fenced backyards.These homes have wonderful tall ceilings,designer paint,fully wrapped windows,solid core/glass int doors and top of the line strand Bamboo flrs.The kitchen is an entertainers dream w/an enormous open eating bar,honed granite counters,custom wood cabinets,top of the line stainless steel appls and french doors to the ent backyard. Quality and Designer features from top to Bottom, a must see!!</MR>
  40.                 <LONG>-122.362210</LONG>
  41.                 <LAT>47.561975</LAT>
  42.                 <PDR>1800-01-01 00:00:00</PDR>
  43.                 <CLA>0</CLA>
  44.                 <SHOADR>Y</SHOADR>
  45.                 <DD>From Delridge Way head east on Oregon which becomes 23rd.</DD>
  46.                 <AVDT>1800-01-01 00:00:00</AVDT>
  47.                 <INDT>1800-01-01 00:00:00</INDT>
  48.                 <COU>King</COU>
  49.                 <CDOM>0</CDOM>
  50.                 <CTDT>2010-04-24 00:00:00</CTDT>
  51.                 <SCA>0</SCA>
  52.                 <SCO>0</SCO>
  53.                 <VIRT/>
  54.                 <SD>SEA</SD>
  55.                 <SDT>2010-04-24 00:00:00</SDT>
  56.                 <FIN/>
  57.                 <MAPBOOK>THOM</MAPBOOK>
  58.                 <DSR>Pigeon Point</DSR>
  59.                 <QBT>0</QBT>
  60.                 <LSZS/>
  61.                 <HSNA/>
  62.                 <COLO>0</COLO>
  63.                 <PIC>1</PIC>
  64.                 <ADU/>
  65.                 <ARC>K</ARC>
  66.                 <BDC/>
  67.                 <BDL>2</BDL>
  68.                 <BDM>0</BDM>
  69.                 <BDU>3</BDU>
  70.                 <BLD>JDR Development Inc</BLD>
  71.                 <BLK>14</BLK>
  72.                 <BRM/>
  73.                 <BUS>Y</BUS>
  74.                 <DNO>L</DNO>
  75.                 <DRM>M</DRM>
  76.                 <EFR/>
  77.                 <EL/>
  78.                 <ENT>M</ENT>
  79.                 <F17>A</F17>
  80.                 <FAM>M</FAM>
  81.                 <FBG>0</FBG>
  82.                 <FBL>1</FBL>
  83.                 <FBM>0</FBM>
  84.                 <FBT>3</FBT>
  85.                 <FBU>2</FBU>
  86.                 <FP>1</FP>
  87.                 <FPL>0</FPL>
  88.                 <FPM>1</FPM>
  89.                 <FPU>0</FPU>
  90.                 <GAR>2</GAR>
  91.                 <HBG>0</HBG>
  92.                 <HBL>0</HBL>
  93.                 <HBM>1</HBM>
  94.                 <HBT>1</HBT>
  95.                 <HBU>0</HBU>
  96.                 <HOD>0</HOD>
  97.                 <JH/>
  98.                 <KES>M</KES>
  99.                 <KIT/>
  100.                 <LRM>M</LRM>
  101.                 <LSD/>
  102.                 <LSZ/>
  103.                 <LT>16</LT>
  104.                 <MBD>U</MBD>
  105.                 <MHM/>
  106.                 <MHN/>
  107.                 <MHS/>
  108.                 <MOR>0</MOR>
  109.                 <NC>U</NC>
  110.                 <POC>SEA</POC>
  111.                 <POL/>
  112.                 <PRJ>Cottage Grove # 3</PRJ>
  113.                 <PTO>Y</PTO>
  114.                 <TQBT>0</TQBT>
  115.                 <RRM>L</RRM>
  116.                 <SAP>0</SAP>
  117.                 <SFF>0</SFF>
  118.                 <SFS>Per Builder Plans</SFS>
  119.                 <SFU>0</SFU>
  120.                 <SH/>
  121.                 <SML>Y</SML>
  122.                 <SNR>N</SNR>
  123.                 <STY>18</STY>
  124.                 <SWC>SEA</SWC>
  125.                 <TBG>0</TBG>
  126.                 <TBL>0</TBL>
  127.                 <TBM>0</TBM>
  128.                 <TBU>0</TBU>
  129.                 <TX>0</TX>
  130.                 <TXY>0</TXY>
  131.                 <UTR>U</UTR>
  132.                 <WAC>SEA</WAC>
  133.                 <WFG/>
  134.                 <WHT/>
  135.                 <APS>A|D|E|F|G</APS>
  136.                 <BDI>E</BDI>
  137.                 <BSM>A|B</BSM>
  138.                 <ENS>B</ENS>
  139.                 <EXT>J|E</EXT>
  140.                 <FEA>A|D|F|G|J|M|P|T</FEA>
  141.                 <FLS>J|A|G</FLS>
  142.                 <FND>E|F</FND>
  143.                 <GR>C</GR>
  144.                 <HTC>B</HTC>
  145.                 <LDE>H|J</LDE>
  146.                 <LTV>E|F</LTV>
  147.                 <POS>A</POS>
  148.                 <RF>C</RF>
  149.                 <SIT>G|H|M|Y|N</SIT>
  150.                 <SWR>A</SWR>
  151.                 <TRM>B|C</TRM>
  152.                 <VEW>D|L</VEW>
  153.                 <WAS>D</WAS>
  154.                 <WFT/>
  155.                 <BUSR/>
  156.                 <CMFE/>
  157.                 <ECRT/>
  158.                 <ZJD>A</ZJD>
  159.                 <ZNC>SF 5000</ZNC>
  160.                 <ProhibitBLOG>Y</ProhibitBLOG>
  161.                 <AllowAVM>Y</AllowAVM>
  162.                 <PARQ>N</PARQ>
  163.                 <BREO>N</BREO>
  164.         </Residential>
  165. </Listings>
  166.  

Results
Code: [Select]
LN = 62981
PTYP = RESI
LAG = 27022
ST = CT
LP = 599950.00
SP = 0.00
OLP = 599950.00
HSN = 4538
STR = 23rd
SSUF = Ave
DRS = SW
CIT = Seattle
STA = WA
ZIP = 98126
BR = 5.00
BTH = 3.50
ASF = 3650
LSF = 4800
UD = 2010-04-24 14:59:25
AR = 140
DSRNUM = 7215
LDR = 2010-04-24 00:00:00
LD = 2010-04-24 00:00:00
CLO = 1800-01-01 00:00:00
YBT = 2010
LO = 1401
TAX = 1773600264
MAP = 594
GRDX = G
GRDY = 4
SAG = 0
SO = 0
NIA = Y
MR = Third of three New Contemporary Homes w/fantastic open floor plans and great level and fenced backyards.These homes have wonderful tall ceilings,designer paint,fully wrapped windows,solid core/glass int doors and top of the line strand Bamboo flrs.The kitchen is an entertainers dream w/an enormous open eating bar,honed granite counters,custom wood cabinets,top of the line stainless steel appls and french doors to the ent backyard. Quality and Designer features from top to Bottom, a must see!!
LONG = -122.362210
LAT = 47.561975
PDR = 1800-01-01 00:00:00
CLA = 0
SHOADR = Y
DD = From Delridge Way head east on Oregon which becomes 23rd.
AVDT = 1800-01-01 00:00:00
INDT = 1800-01-01 00:00:00
COU = King
CDOM = 0
CTDT = 2010-04-24 00:00:00
SCA = 0
SCO = 0
SD = SEA
SDT = 2010-04-24 00:00:00
MAPBOOK = THOM
DSR = Pigeon Point
QBT = 0
COLO = 0
PIC = 1
ARC = K
BDL = 2
BDM = 0
BDU = 3
BLD = JDR Development Inc
BLK = 14
BUS = Y
DNO = L
DRM = M
ENT = M
F17 = A
FAM = M
FBG = 0
FBL = 1
FBM = 0
FBT = 3
FBU = 2
FP = 1
FPL = 0
FPM = 1
FPU = 0
GAR = 2
HBG = 0
HBL = 0
HBM = 1
HBT = 1
HBU = 0
HOD = 0
KES = M
LRM = M
LT = 16
MBD = U
MOR = 0
NC = U
POC = SEA
PRJ = Cottage Grove # 3
PTO = Y
TQBT = 0
RRM = L
SAP = 0
SFF = 0
SFS = Per Builder Plans
SFU = 0
SML = Y
SNR = N
STY = 18
SWC = SEA
TBG = 0
TBL = 0
TBM = 0
TBU = 0
TX = 0
TXY = 0
UTR = U
WAC = SEA
APS = A|D|E|F|G
BDI = E
BSM = A|B
ENS = B
EXT = J|E
FEA = A|D|F|G|J|M|P|T
FLS = J|A|G
FND = E|F
GR = C
HTC = B
LDE = H|J
LTV = E|F
POS = A
RF = C
SIT = G|H|M|Y|N
SWR = A
TRM = B|C
VEW = D|L
WAS = D
ZJD = A
ZNC = SF 5000
ProhibitBLOG = Y
AllowAVM = Y
PARQ = N
BREO = N
« Last Edit: December 01, 2010, 09:37:55 PM by JRS »

JRS

  • Guest
Re: XML Parsing
« Reply #1 on: December 01, 2010, 11:07:56 AM »
I gave my XML parsing routine a stress test and tried to convert a 10 MB XML file. Worked great. I bet the array SB created with the SPLITA had a sizable element count. (returned 432,485 populated data fields [data and column name]) I poll the web service every three hours so I will have nowhere near the amount of XML data that I used to test this XML parsing approach. Armando wasn't kidding when he said text processing / parsing is one of SB's strong points.
« Last Edit: December 01, 2010, 11:45:54 AM by JRS »

JRS

  • Guest
Re: XML Parsing
« Reply #2 on: December 03, 2010, 11:36:10 PM »
I ran across this thread about parsing XML web service responses and had to chuckle. Here is what my tinyXML parser returned.

Code: [Select]
Latitude = 32.9659843
Longitude = 96.74525
AllocationFactor = 0.002192
FipsCode = 48
PlaceName = RICHARDSON
StateCode = TX
Day = Sunday, June 07, 2009
WeatherImage = url
MaxTemperatureF = 94
MinTemperatureF = 74
MaxTemperatureC = 34
MinTemperatureC = 23
Day = Monday, June 08, 2009
WeatherImage = url
MaxTemperatureF = 94
MinTemperatureF = 74
MaxTemperatureC = 34
MinTemperatureC = 23
Day = Tuesday, June 09, 2009
WeatherImage = url
MaxTemperatureF = 95
MinTemperatureF = 76
MaxTemperatureC = 35
MinTemperatureC = 24
Day = Wednesday, June 10, 2009
WeatherImage = url
MaxTemperatureF = 93
MinTemperatureF = 74
MaxTemperatureC = 34
MinTemperatureC = 23
Day = Thursday, June 11, 2009
WeatherImage = url
MaxTemperatureF = 93
MinTemperatureF = 73
MaxTemperatureC = 34
MinTemperatureC = 23
Day = Friday, June 12, 2009
WeatherImage = url
MaxTemperatureF = 93
MinTemperatureF = 73
MaxTemperatureC = 34
MinTemperatureC = 23
Day = Saturday, June 13, 2009
WeatherImage = url
MaxTemperatureF = 94
MinTemperatureF = 73
MaxTemperatureC = 34
MinTemperatureC = 23




AIR

  • Guest
Re: XML Parsing
« Reply #3 on: December 12, 2010, 12:26:00 AM »
MBC response to your original post:

Code: [Select]
$execon

dim s$*8192,i,dest$[8192]
s$ = LoadFile$("residential.xml")
for i = 2 to split(dest$,s$,lf$)-3
dest$[i]= left$(dest$[i],instr(dest$[i],"</")-1)
replace ">" with " = " in dest$[i]
remove "<" from dest$[i]
if len(dest$[i]) then print trim$(dest$[i])
next

AIR

  • Guest
Re: XML Parsing
« Reply #4 on: December 12, 2010, 02:03:58 PM »
And here is a version using Mini-XML.

This post should probably be moved elsewhere, but I just wanted to show an alternative approach that can provide much more flexibility should the need arise....

FWIW, the binary comes in at 5K optimized under Linux (dynamically linked) and 46K optimized (statically linked) under OSX.

Code: [Select]
$execon "-lmxml"
$nomain

$HEADER
#include <mxml.h>
typedef mxml_node_t* XMLROOT;
typedef XMLROOT XMLNODE;
typedef mxml_index_t* XMLINDEX;
$HEADER

FUNCTION main(argc as INTEGER, argv as PCHAR PTR) as INTEGER
  dim tree as XMLROOT, node as XMLNODE, sub_node as XMLNODE,ind as XMLINDEX
  dim category$, value$
 
  OPEN "residential.xml" FOR INPUT AS xmlFile
  tree = mxmlLoadFile(NULL, xmlFile, MXML_OPAQUE_CALLBACK)
  CLOSE xmlFile
 
  ind = mxmlIndexNew(tree, NULL,NULL)
  mxmlIndexReset(ind)
 
  for integer cnt = 0 to ind->num_nodes
    node = mxmlIndexEnum(ind)
    if node = NULL then iterate
    sub_node = mxmlWalkNext(node, tree,MXML_DESCEND_FIRST)
    category$ = trim$(node->value.opaque$)
    value$ = trim$(sub_node->value.opaque$)
    if value$ != NUL$ then print category$;" = ";value$
  next

  mxmlIndexDelete(ind)
  mxmlDelete(tree)   
END FUNCTION

Quote from: OUTPUT
APS = A|D|E|F|G
AR = 140
ARC = K
ASF = 3650
AVDT = 1800-01-01 00:00:00
AllowAVM = Y
BDI = E
BDL = 2
BDM = 0
BDU = 3
BLD = JDR Development Inc
BLK = 14
BR = 5.00
BREO = N
BSM = A|B
BTH = 3.50
BUS = Y
CDOM = 0
CIT = Seattle
CLA = 0
CLO = 1800-01-01 00:00:00
COLO = 0
COU = King
CTDT = 2010-04-24 00:00:00
DD = From Delridge Way head east on Oregon which becomes 23rd.
DNO = L
DRM = M
DRS = SW
DSR = Pigeon Point
DSRNUM = 7215
ENS = B
ENT = M
EXT = J|E
F17 = A
FAM = M
FBG = 0
FBL = 1
FBM = 0
FBT = 3
FBU = 2
FEA = A|D|F|G|J|M|P|T
FLS = J|A|G
FND = E|F
FP = 1
FPL = 0
FPM = 1
FPU = 0
GAR = 2
GR = C
GRDX = G
GRDY = 4
HBG = 0
HBL = 0
HBM = 1
HBT = 1
HBU = 0
HOD = 0
HSN = 4538
HTC = B
INDT = 1800-01-01 00:00:00
KES = M
LAG = 27022
LAT = 47.561975
LD = 2010-04-24 00:00:00
LDE = H|J
LDR = 2010-04-24 00:00:00
LN = 62981
LO = 1401
LONG = -122.362210
LP = 599950.00
LRM = M
LSF = 4800
LT = 16
LTV = E|F
MAP = 594
MAPBOOK = THOM
MBD = U
MOR = 0
MR = Third of three New Contemporary Homes w/fantastic open floor plans and great level and fenced backyards.These homes have wonderful tall ceilings,designer paint,fully wrapped windows,solid core/glass int doors and top of the line strand Bamboo flrs.The kitchen is an entertainers dream w/an enormous open eating bar,honed granite counters,custom wood cabinets,top of the line stainless steel appls and french doors to the ent backyard. Quality and Designer features from top to Bottom, a must see!!
NC = U
NIA = Y
OLP = 599950.00
PARQ = N
PDR = 1800-01-01 00:00:00
PIC = 1
POC = SEA
POS = A
PRJ = Cottage Grove # 3
PTO = Y
PTYP = RESI
ProhibitBLOG = Y
QBT = 0
RF = C
RRM = L
SAG = 0
SAP = 0
SCA = 0
SCO = 0
SD = SEA
SDT = 2010-04-24 00:00:00
SFF = 0
SFS = Per Builder Plans
SFU = 0
SHOADR = Y
SIT = G|H|M|Y|N
SML = Y
SNR = N
SO = 0
SP = 0.00
SSUF = Ave
ST = CT
STA = WA
STR = 23rd
STY = 18
SWC = SEA
SWR = A
TAX = 1773600264
TBG = 0
TBL = 0
TBM = 0
TBU = 0
TQBT = 0
TRM = B|C
TX = 0
TXY = 0
UD = 2010-04-24 14:59:25
UTR = U
VEW = D|L
WAC = SEA
WAS = D
YBT = 2010
ZIP = 98126
ZJD = A
ZNC = SF 5000

The mxmlIndexNew() auto sorts the input, as you can see.

A.

JRS

  • Guest
Re: XML Parsing
« Reply #5 on: December 12, 2010, 02:41:07 PM »
I have a HUGE need for a reliable XML parser (data extractor) for Linux. I remember you created a ScriptBasic extension module for mini-XML to replace the GNOME libxml2 module that seemed buggy. Did you release the source to your SB version of mini-XML? I would like to give it a try under Ubuntu 32 if all possible.

I noticed your first attempt at a brute force XML parser but the Achilles Heel of that approach with BCX is that you have to know how big to DIM your work array. Under SB, array allocation is dynamic and the array that SPLITA created for the 10 MB XML I tried my tinyXML parser on must have over a million elements. As you say, I need a bit more control than just a utility that acts as a filter stripping XML tags top down.


AIR

  • Guest
Re: XML Parsing
« Reply #6 on: December 12, 2010, 04:03:03 PM »
I have a HUGE need for a reliable XML parser (data extractor) for Linux. I remember you created a ScriptBasic extension module for mini-XML to replace the GNOME libxml2 module that seemed buggy. Did you release the source to your SB version of mini-XML? I would like to give it a try under Ubuntu 32 if all possible.

It's in the source distribution.

Quote
I noticed your first attempt at a brute force XML parser but the Achilles Heel of that approach with BCX is that you have to know how big to DIM your work array. Under SB, array allocation is dynamic and the array that SPLITA created for the 10 MB XML I tried my tinyXML parser on must have over a million elements. As you say, I need a bit more control than just a utility that acts as a filter stripping XML tags top down.

Well that's mostly true if you're using MBC in pure C mode, but if you use it in CPP/WX/GLIB/GTK mode, you can utilize the dynamic strings that each of those options offer.

Or, you could always get the file size, and redimension a C string.....

A.

JRS

  • Guest
Re: XML Parsing
« Reply #7 on: December 12, 2010, 04:10:11 PM »
Quote
you can utilize the dynamic strings

My vote is for MBC to have dynamic string support by default. Most Basic languages have dynamic string support and garbage collection built in. As feature rich as BCX is, not having this as a default is puzzling.


AIR

  • Guest
Re: XML Parsing
« Reply #8 on: December 12, 2010, 06:56:04 PM »
I've been tinkering with libgc for garbage collection; the problem is that I would have to alter quite a bit of code to take advantage of it.  The good thing is that it's malloc function works like calloc, so no more having to memset objects.

As far as truly dynamic strings, I have a few thoughts on how to implement that, but again I would have to alter quite a bit of code to implement this.

The upshot is that dynamic strings won't happen until I have garbage collection in place.  Not having garbage collection would lead to a bunch of potential memory leaks, at the very least, or a LOT of support code being added to avoid leaks, which I'm not prepared to do.

At any rate, this might not happen soon, but it's on my to-do list....

A.

AIR

  • Guest
Re: XML Parsing
« Reply #9 on: December 13, 2010, 11:14:38 AM »
Quote from: AIR
Or, you could always get the file size, and redimension a C string.....

Like this, for example:
Code: [Select]
$execon

CONST FName = "residential.xml"

dim s as PCHAR, i, dest$[MAX_PATH]

redim s * LOF(FName)+1
s$ = LoadFile$(FName)

for i = 2 to split(dest$,s$,lf$)-3
dest$[i]= left$(dest$[i],instr(dest$[i],"</")-1)
replace ">" with " = " in dest$[i]
remove "<" from dest$[i]
if len(dest$[i]) then print trim$(dest$[i])
next

free(s)