{"id":579,"date":"2014-10-09T21:33:47","date_gmt":"2014-10-09T16:03:47","guid":{"rendered":"http:\/\/codeforgeek.com\/?p=579"},"modified":"2021-06-19T13:10:17","modified_gmt":"2021-06-19T07:40:17","slug":"parse-large-xml-files-node","status":"publish","type":"post","link":"https:\/\/codeforgeek.com\/parse-large-xml-files-node\/","title":{"rendered":"Parse large XML files in Node"},"content":{"rendered":"<p>Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using <a href=\"https:\/\/codeforgeek.com\/facebook-login-using-nodejs-express\/\" title=\"Facebook login using nodejs and express\" target=\"_blank\" rel=\"noopener\">Node.js<\/a>. Many parser out there do not handle large size xml files and throw this error<\/p>\n<blockquote><p>FATAL ERROR JS Allocation failed &#8211; process out of memory<\/p><\/blockquote>\n<p>SAX xml parser handles large xml files but due to it&#8217;s complexity in handling events to capture specific xml node and data, we do not recommend this package either.<\/p>\n<blockquote><p>This code is tested on Ubuntu 14.04. Due to some dependency issue it may not run on Windows.<\/p><\/blockquote>\n<h2>What&#8217;s our Requirement ?<\/h2>\n<p>We wanted XML parser which parse large xml files ( our is 635 megabyte) and allow us to convert it into JSON format for further use or simply allow us to extract only those data which we want and let us traverse through it easily.<\/p>\n<h2>xml-stream Parser:<\/h2>\n<p>After testing all most every high reputed parser ( reputation in terms of downloads daily ) we found this <a href=\"https:\/\/www.npmjs.com\/package\/xml-stream\" title=\"xml-stream \" target=\"_blank\" rel=\"noopener\">awesome <\/a>parser which work exactly the way our requirement was.<\/p>\n<p>Install it using following command.<code>npm install -g xml-stream<\/code><\/p>\n<h2>How to use xml-stream:<\/h2>\n<p>xml-stream is simple and fast. To use xml-stream, require it in your project and pass the ReadFile object to initialize it. See how to initialize it.<br \/>\n<code lang=\"javascript\"><br \/>\nvar fs        = require('fs');<br \/>\nvar XmlStream = require('xml-stream');<br \/>\n\/*<br \/>\n   * Pass the ReadStream object to xml-stream<br \/>\n*\/<br \/>\nvar stream=fs.createReadStream('file_name.xml');<br \/>\nvar xml = new XmlStream(stream);<br \/>\n\/*<br \/>\n  *Further code.<br \/>\n*\/<br \/>\n<\/code><\/p>\n<h2>How it works !<\/h2>\n<p>xml-stream parse the xml content and output them in array structure. Here see the example.<\/p>\n<h4>Input XML<\/h2>\n<p><code lang=\"xml\"><br \/>\n<item id=\"123\" type=\"common\"><br \/>\n  <title>Item Title<\/title><br \/>\n  <description>Description of this item.<\/description><br \/>\n  (text)<br \/>\n<\/item><br \/>\n<\/code><\/p>\n<h4>Parser Output:<\/h4>\n<p><code lang=\"javascript\"><br \/>\n{<br \/>\n  title: 'Item Title',<br \/>\n  description: 'Description of this item.',<br \/>\n  '$': {<br \/>\n    'id': '123',<br \/>\n    'type': 'common'<br \/>\n  },<br \/>\n  '$name': 'item',<br \/>\n  '$text': '(text)'<br \/>\n}<br \/>\n<\/code><\/p>\n<h2>Extract specific xml node:<\/h2>\n<p>Here comes the interesting part, suppose you have large xml file like i have and you want to extract only those information which are enclosed in specific xml node. <strong>xml-stream<\/strong> provides <strong>&#8216;preserve&#8217;<\/strong> and <strong>&#8216;collect&#8217;<\/strong> function to do so. See example.<\/p>\n<h4>XML file Content<\/h4>\n<p><code lang=\"xml\"><br \/>\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><br \/>\n<media mediaId=\"value\" lastModified=\"date\" action=\"add\"><br \/>\n<title size=\"140\" type=\"full\" lang=\"en\">Some title<\/title><br \/>\n<ids><br \/>\n<id type=\"rootId\">10000020<\/id><br \/>\n<id type=\"seriesId\">10000020<\/id><br \/>\n<id type=\"TMSId\">SH017461480000<\/id><br \/>\n<\/ids><br \/>\n<image type=\"image\/jpg\" width=\"270\" height=\"360\" primary=\"true\" category=\"Banner\"><br \/>\n<URI>Some URL<\/URI><\/p>\n<caption lang=\"en\">Some title<\/caption>\n<p><\/image><br \/>\n<\/media><br \/>\n<media mediaId=\"p10000020_b_v4_aa\" lastModified=\"2013-06-14T00:00:00Z\" action=\"add\"><br \/>\n<title size=\"140\" type=\"full\" lang=\"en\">Some title<\/title><br \/>\n<ids><br \/>\n<id type=\"rootId\">10000020<\/id><br \/>\n<id type=\"seriesId\">10000020<\/id><br \/>\n<id type=\"TMSId\">SH017461480000<\/id><br \/>\n<\/ids><br \/>\n<image type=\"image\/jpg\" width=\"540\" height=\"720\" primary=\"true\" category=\"Banner\"><br \/>\n<URI>Some URL<\/URI><\/p>\n<caption lang=\"en\">Some title<\/caption>\n<p><\/image><br \/>\n<\/media><br \/>\n<\/xml><br \/>\n<\/code><br \/>\nNow i want to extract only values of &lt;id&gt; and print them. Here is a code to do so.<br \/>\n<code lang=\"javascript\"><br \/>\nvar fs        = require('fs')<br \/>\nvar XmlStream = require('xml-stream') ;<br \/>\nvar stream=fs.createReadStream('tvbanners.xml');<br \/>\nvar xml = new XmlStream(stream);<br \/>\nxml.preserve('id', true);<br \/>\nxml.collect('subitem');<br \/>\nxml.on('endElement: id', function(item) {<br \/>\n  console.log(item);<br \/>\n});<br \/>\n<\/code><\/p>\n<h4>Parser Output:<\/h4>\n<p>I have run the command and put the output in text file using this.<br \/>\n<code>node server.js > output.txt<\/code><br \/>\nHere is my output text file.<br \/>\n<code lang=\"javascript\"><br \/>\n{ '$children': [ '10000020' ],<br \/>\n  '$': { type: 'rootId' },<br \/>\n  '$text': '10000020',<br \/>\n  '$name': 'id' }<br \/>\n{ '$children': [ '10000020' ],<br \/>\n  '$': { type: 'seriesId' },<br \/>\n  '$text': '10000020',<br \/>\n  '$name': 'id' }<br \/>\n{ '$children': [ 'SH017461480000' ],<br \/>\n  '$': { type: 'TMSId' },<br \/>\n  '$text': 'SH017461480000',<br \/>\n  '$name': 'id' }<br \/>\n   .<br \/>\n   .<br \/>\n   ....more content<br \/>\n<\/code><br \/>\nIf you want to print specific xml node content, you can do by using <code lang=\"javascript\">console.log(item['$text']);<\/code> Or <code lang=\"javascript\">console.log(item['$']['type']);<\/code> to go inside the array of array.<\/p>\n<p>This is it for now. Ask any doubt if you have in comments.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error FATAL ERROR JS Allocation failed &#8211; process out of memory SAX xml parser handles large xml files but due to it&#8217;s [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":591,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_surecart_dashboard_logo_width":"180px","_surecart_dashboard_show_logo":true,"_surecart_dashboard_navigation_orders":true,"_surecart_dashboard_navigation_invoices":true,"_surecart_dashboard_navigation_subscriptions":true,"_surecart_dashboard_navigation_downloads":true,"_surecart_dashboard_navigation_billing":true,"_surecart_dashboard_navigation_account":true,"_uag_custom_page_level_css":"","footnotes":""},"categories":[14,18],"tags":[],"class_list":["post-579","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-nodejs","category-tutorial"],"blocksy_meta":[],"uagb_featured_image_src":{"full":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner.png",876,288,false],"thumbnail":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner-150x150.png",150,150,true],"medium":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner-300x99.png",300,99,true],"medium_large":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner-768x252.png",768,252,true],"large":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner.png",876,288,false],"1536x1536":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner.png",876,288,false],"2048x2048":["https:\/\/codeforgeek.com\/wp-content\/uploads\/2014\/10\/banner.png",876,288,false]},"uagb_author_info":{"display_name":"Shahid","author_link":"https:\/\/codeforgeek.com\/author\/shahid\/"},"uagb_comment_info":0,"uagb_excerpt":"Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error FATAL ERROR JS Allocation failed &#8211; process out of memory SAX xml parser handles large xml files but due to it&#8217;s&hellip;","_links":{"self":[{"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/posts\/579","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/comments?post=579"}],"version-history":[{"count":0,"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/posts\/579\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/media\/591"}],"wp:attachment":[{"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/media?parent=579"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/categories?post=579"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/codeforgeek.com\/wp-json\/wp\/v2\/tags?post=579"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}