{"id":7427,"date":"2022-08-03T15:04:09","date_gmt":"2022-08-03T09:34:09","guid":{"rendered":"https:\/\/www.intelligencenode.com\/blog\/?p=7427"},"modified":"2026-03-02T12:20:53","modified_gmt":"2026-03-02T06:50:53","slug":"digital-commerce-data-mining-data-acquisition-in-retail","status":"publish","type":"post","link":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/","title":{"rendered":"Digital Commerce Data Mining &#8211; Data Acquisition in Retail"},"content":{"rendered":"\n<pre class=\"wp-block-preformatted\"><em>We are starting a new series on the practical applications of data science in retail called, \"Digital Commerce Data Mining\".&nbsp;The first article in the series is 'Data Acquisition in Retail - Adaptive Data Collection'. Data acquisition at a large scale and at affordable costs is not possible manually. It is a rigorous process and it comes with its own challenges. To address these challenges, Intelligence Node\u2019s&nbsp;analytics and data science team has developed strategies through advanced analytics and continuous R&amp;D, which we will be discussing at length in this article.<\/em><\/pre>\n\n\n\n<p><strong>An expert outlook on practical data science use cases in retail<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\"><strong>Introduction<\/strong><\/h2>\n\n\n\n<p>Intelligence Node has to crawl millions of web pages daily to provide its customers with real-time, high-velocity, and accurate data. But data acquisition at such a large scale and at affordable costs is not possible manually. It is a rigorous process and it comes with its own challenges. To address these challenges, Intelligence Node\u2019s analytics and data science team has developed strategies through advanced analytics and continuous R&amp;D.&nbsp;<\/p>\n\n\n\n<p>In this part of the \u2018Alpha Capture in Digital Commerce series\u2019, we will explore the data acquisition challenges in retail and discuss data science applications to solve these challenges.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-adaptive-crawling-for-data-acquisition\"><strong>Adaptive Crawling for Data Acquisition<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"895\" height=\"500\" src=\"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/adaptive-crawling.png\" alt=\"Adaptive Crawling\" class=\"wp-image-13297\" srcset=\"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/adaptive-crawling.png 895w, https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/adaptive-crawling-300x168.png 300w, https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/adaptive-crawling-768x429.png 768w\" sizes=\"auto, (max-width: 895px) 100vw, 895px\" \/><\/figure>\n\n\n\n<p>Adaptive crawling consists of 2 components:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-elegant-middleware-smart-proxy\"><strong>The elegant middleware: Smart proxy<\/strong><\/h2>\n\n\n\n<p>Intelligence Node\u2019s team of data scientists has worked on developing intelligent, automated strategies to overcome crawling challenges such as high costs, labor intensiveness, and low success rates.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Builds a recipe (plan) for the target from the available strategies<\/li>\n\n\n\n<li>Tries to minimize it based on:<\/li>\n\n\n\n<li>Price<\/li>\n\n\n\n<li>Success rate<\/li>\n\n\n\n<li>Speed<\/li>\n<\/ul>\n\n\n\n<p><strong>Some of the strategies are<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Election decision of a certain IP address pool<\/li>\n\n\n\n<li>By using mobile\/residential IPs<\/li>\n\n\n\n<li>By using different user-agents<\/li>\n\n\n\n<li>With a&nbsp;<strong>custom<\/strong>&nbsp;developed browser (cluster)<\/li>\n\n\n\n<li>By sending special headers\/cookies<\/li>\n\n\n\n<li>Using anti blocker [Anti-PerimeterX] strategies<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-heavy-lifting-parsing\"><strong>The heavy lifting: Parsing<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-auto-parsing-nbsp\">Auto Parsing&nbsp;<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The data acquisition team utilizes a custom-tuned transformer-encoder-based network (similar to BERT). This network converts webpages to text for information retrieval of generic information available on product pages such as price, title, description, and image URLs.&nbsp;<\/li>\n\n\n\n<li>The network is layout aware and utilizes CSS properties of elements to extract text representations of HTML without rendering it as opposed to the Selenium-based extraction method.<\/li>\n\n\n\n<li>The network can extract information from nested tables and complex textual structures. This is possible as the model understands both language and HTML DOM.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-visual-parsing\">Visual Parsing<\/h3>\n\n\n\n<p>Another way of information extraction from web pages or PDFs\/screenshots is through Visual Scraping. Often when crawling is not an option, the analytics and data science team uses a custom-built visual, AI-based crawling solution.&nbsp;<\/p>\n\n\n\n<p>Details<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For external sources where crawling is not permissible, the team uses visual AI based crawling solution<\/li>\n\n\n\n<li>The team uses Object Detection using Yolo (CNN based) architecture to precisely identify product page into objects of interest. For example, title, price, information, and image area.<\/li>\n\n\n\n<li>The team sends pdfs\/images\/videos to get textual information by attaching OCR Network at the end of this hybrid architecture.<\/li>\n<\/ul>\n\n\n\n<p id=\"h-example\">Example<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/12\/1655238140203-1024x576.png\" alt=\"Object detection\" class=\"wp-image-7833\" srcset=\"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/12\/1655238140203-1024x576.png 1024w, https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/12\/1655238140203-300x169.png 300w, https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/12\/1655238140203-768x432.png 768w, https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/12\/1655238140203.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-tech-stack\"><strong>Tech Stack<\/strong><\/h2>\n\n\n\n<p>The team uses the below tech stack to build the anti-blocker technology widely used by Intelligence Node:<\/p>\n\n\n\n<p><strong>Linux (Ubuntu)<\/strong>, a default choice for servers, acts as our base OS, helping us deploy our applications. We use&nbsp;<strong>Python<\/strong>&nbsp;to develop our ML model as it supports most of the libraries and is easy to use.&nbsp;<strong>Pytorch,&nbsp;<\/strong>an open source machine learning framework based on the torch library, is a preferred choice for research prototyping to model building and training. Although similar to TensorFlow, Pytorch is faster and is useful when developing models from scratch. We use&nbsp;<strong>FastAPI<\/strong>&nbsp;for API endpoints and for maintenance and service. FastAPI is a web framework that allows the model to be accessible from everywhere.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>We Provide Sophisticated eCommerce Insights served via Scalable APIs, Custom Data Exports, &amp; SaaS Portal : <a href=\"https:\/\/www.intelligencenode.com\/products\/data-intelligence\/\">Learn More<\/a><\/em><\/pre>\n\n\n\n<p>We moved from Flask to FastAPI for its additional benefits. These benefits include simple syntax, extremely fast framework, asynchronous requests, better query handling, and world-class documentation. Lastly,&nbsp;Docker,&nbsp;a containerization platform,&nbsp;allows us to bundle all of the above into a container that can be deployed easily across different platforms and environments.&nbsp;Kubernetes&nbsp;allows us to automatically orchestrate, scale, and manage these containerized applications to handle the load on autopilot &#8211; if the load is heavy it scales up to handle the extra load and vice versa.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>In the digital age of retail, giants like Amazon are leveraging advanced data analytics and <a href=\"https:\/\/www.intelligencenode.com\/blog\/are-price-recommendations-enough\/\">pricing engines<\/a> to review the prices of millions of products every few minutes. And to compete with this level of sophistication and offer competitive pricing, assortment, and personalized experiences to today\u2019s comparison shoppers, AI-driven data analytics is a must. Data acquisition through competitor website crawling has no alternative. As the retail industry becomes more real-time and fierce, the velocity, variety, and volume of data will need to keep upgrading at the same rate. Through these data acquisition innovations developed by the team,&nbsp;<a href=\"https:\/\/www.intelligencenode.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Intelligence Node<\/a>&nbsp;aims to constantly provide the most accurate and comprehensive data to its clients while also sharing its analytical abilities with data analytics enthusiasts everywhere.&nbsp;<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are starting a new series on the practical applications of data science in retail called, &#8220;Digital Commerce Data Mining&#8221;.&nbsp;The first article in the series is &#8216;Data Acquisition in Retail&#8230;<\/p>\n","protected":false},"author":2,"featured_media":7436,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"content-type":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4,44],"tags":[],"class_list":["post-7427","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","category-future-retail"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.4 (Yoast SEO v27.5) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Adaptive Crawling for Data Acquisition<\/title>\n<meta name=\"description\" content=\"In this article, we explore the challenges of data acquisition in retail and discuss data science applications to solve these challenges.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Digital Commerce Data Mining - Data Acquisition in Retail\" \/>\n<meta property=\"og:description\" content=\"In this article, we explore the challenges of data acquisition in retail and discuss data science applications to solve these challenges.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/intelligencenode\" \/>\n<meta property=\"article:published_time\" content=\"2022-08-03T09:34:09+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-03-02T06:50:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Intelligence Node\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bigdataNODE\" \/>\n<meta name=\"twitter:site\" content=\"@bigdataNODE\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Intelligence Node\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Adaptive Crawling for Data Acquisition","description":"In this article, we explore the challenges of data acquisition in retail and discuss data science applications to solve these challenges.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/","og_locale":"en_US","og_type":"article","og_title":"Digital Commerce Data Mining - Data Acquisition in Retail","og_description":"In this article, we explore the challenges of data acquisition in retail and discuss data science applications to solve these challenges.","og_url":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/","og_site_name":"Blog","article_publisher":"https:\/\/www.facebook.com\/intelligencenode","article_published_time":"2022-08-03T09:34:09+00:00","article_modified_time":"2026-03-02T06:50:53+00:00","og_image":[{"width":1280,"height":720,"url":"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg","type":"image\/jpeg"}],"author":"Intelligence Node","twitter_card":"summary_large_image","twitter_creator":"@bigdataNODE","twitter_site":"@bigdataNODE","twitter_misc":{"Written by":"Intelligence Node","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#article","isPartOf":{"@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/"},"author":{"name":"Intelligence Node","@id":"https:\/\/www.intelligencenode.com\/blog\/#\/schema\/person\/dec96545f790884e8a05f794934695f1"},"headline":"Digital Commerce Data Mining &#8211; Data Acquisition in Retail","datePublished":"2022-08-03T09:34:09+00:00","dateModified":"2026-03-02T06:50:53+00:00","mainEntityOfPage":{"@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/"},"wordCount":774,"image":{"@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#primaryimage"},"thumbnailUrl":"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg","articleSection":["Big Data","Future Retail"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/","url":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/","name":"Adaptive Crawling for Data Acquisition","isPartOf":{"@id":"https:\/\/www.intelligencenode.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#primaryimage"},"image":{"@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#primaryimage"},"thumbnailUrl":"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg","datePublished":"2022-08-03T09:34:09+00:00","dateModified":"2026-03-02T06:50:53+00:00","author":{"@id":"https:\/\/www.intelligencenode.com\/blog\/#\/schema\/person\/dec96545f790884e8a05f794934695f1"},"description":"In this article, we explore the challenges of data acquisition in retail and discuss data science applications to solve these challenges.","breadcrumb":{"@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#primaryimage","url":"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg","contentUrl":"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg","width":1280,"height":720,"caption":"Data Acquisition in Retail data mining"},{"@type":"BreadcrumbList","@id":"https:\/\/www.intelligencenode.com\/blog\/digital-commerce-data-mining-data-acquisition-in-retail\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.intelligencenode.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Digital Commerce Data Mining &#8211; Data Acquisition in Retail"}]},{"@type":"WebSite","@id":"https:\/\/www.intelligencenode.com\/blog\/#website","url":"https:\/\/www.intelligencenode.com\/blog\/","name":"Blog","description":"Intelligence Node Blog - Tips to Maximize Ecommerce Growth","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.intelligencenode.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.intelligencenode.com\/blog\/#\/schema\/person\/dec96545f790884e8a05f794934695f1","name":"Intelligence Node","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7480bd94ee02b87b4ebf5881cdc6b554b7bad668d9932aab5765809e15ab9a2d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7480bd94ee02b87b4ebf5881cdc6b554b7bad668d9932aab5765809e15ab9a2d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7480bd94ee02b87b4ebf5881cdc6b554b7bad668d9932aab5765809e15ab9a2d?s=96&d=mm&r=g","caption":"Intelligence Node"}}]}},"jetpack_featured_media_url":"https:\/\/www.intelligencenode.com\/blog\/wp-content\/uploads\/2022\/08\/1655237634970-4.jpeg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/posts\/7427","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/comments?post=7427"}],"version-history":[{"count":22,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/posts\/7427\/revisions"}],"predecessor-version":[{"id":13626,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/posts\/7427\/revisions\/13626"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/media\/7436"}],"wp:attachment":[{"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/media?parent=7427"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/categories?post=7427"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.intelligencenode.com\/blog\/wp-json\/wp\/v2\/tags?post=7427"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}