Crawler – Crawling Through Internet

        Hello All, Welcome back to TechSimplify 🙂 Today we will look at a very special, very interesting technology or rather terminology in Internet, known as Web Crawler or simply Crawler. A Web Crawler, also known as “Spider”, is an internet bot or program or automated script that browses (crawls) through the World Wide Web, typically for the purpose of Web indexing or Web spidering. Web indexing refers to methods for indexing the contents of a website or of the Internet as a whole. Search Engine like google, yahoo uses Web Crawler to update their records or content, indices of other site’s web content. One of the application of Web Crawler is to keep track of all the pages visited for later to be processed by a search engine to index the pages, so that user can find the most relevant data on internet more efficiently.

Web crawling (also known as web scraping) is widely applied in many areas today. Web crawler tools are getting well-known to the common, since the web crawler has simplified and automated the entire crawling process to make web data resource become easily accessible to everyone. There are many Web crawler tools which enables users to crawl the world-wide web in a methodical and fast manner without coding and transform the data into various formats conforming to their needs.

Crawlers consume resources on the systems they visit and often visit sites without approval. As the number of pages on the internet is extremely large, even the largest crawlers fall short of making a complete index. For that reason search engines were bad at giving relevant search results in the early years of the World Wide Web, before the year 2000. This is improved greatly by modern search engines; nowadays relevant results are given almost instantly. Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping.

How Crawlers Works?

Crawler-2

        A Web Crawlers starts with a list of URLs to visit, known as seeds. As the crawler visits these URLs, it identifies all the hyperlink in the page and adds them to the list of URLs to visit called crawl frontier. URLs from the Crawl Frontier are recursively visited according to a set of predefined policies. If the crawler is performing archiving of websites it copies and saves the information as it goes. The archives are usually stored in such a way they can be viewed, read and navigated as they were on the live web, but are preserved as “snapshots”. The archive is known as the repository and is designed to store and manage the collection of web pages. Repository only stores HTML pages and these pages are stored as distinct files. A repository is similar to any other system that stores data, like a modern-day database. The only difference is that a repository does not need all the functionality offered by a database system. The repository stores the most recent version of the web page retrieved by the crawler.

Crawler-3

Architecture of Web Crawler

Crawler-5
source: wikipedia.org

The above figure denotes the high level architecture for a Web Crawler. In-depth architecture for Web Crawler is as follows –

Crawler-4

The above figure depicts the typical architecture of a large-scale Web crawler. By a large-scale crawler we mean a system capable of gathering billions of documents from the current World Wide Web. It is clear that with such a huge amount of data more sophisticated techniques must be applied than simply parsing HTML files and downloading documents from the URLs extracted from there. The idea is to avoid Web pages (URLs) already visited before, parallelizing crawling (fetching threads) and balancing the load of Web servers from which documents are obtained (server queues), and speeding up the access to Web servers (via DNS Caching).

Crawling Policy

The behavior of the Crawler is completely dependent on the combination of one or more policies. The policies are as follows –

Selection Policy
Re-visit Policy
Politeless Policy
Parallelization Policy

Selection Policy

Given the current size of the Web, even large search engines cover only a portion of the publicly available part. As a crawler always downloads just a fraction of the Web pages, it is highly desirable for the downloaded fraction to contain the most relevant pages and not just a random sample of the Web. This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL. Common Selection policies are –

  • Restricting followed links
  • Path-ascending crawling
  • Focused crawling
  • Crawling the Deep Web

Re-visit Policy

The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates, and deletions. From the search engine’s point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age. The main objective of crawler is to have high average freshness and low average age of web pages. Re-visit policies are –

  • Uniform Policy
  • Proportional Policy

Politeness Policy

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. As noted by Koster, the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include:

Network Resources
Server Overload
Server/Router Crashes
Network and server disruption

A partial solution to these problems is the Robots Exclusion Protocol.

Parallelization policy

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.

Tools for Web Crawlers

Cyotek WebCopy is a free tool for copying full or partial websites locally onto your hard disk for offline viewing. WebCopy will examine the HTML mark-up of a website and attempt to discover all linked resources such as other pages, images, videos, file downloads — anything and everything. It will download all resources, and continue to search for more. In this manner, WebCopy can “crawl” an entire website and download everything it sees in an effort to create a reasonable facsimile of the source website. WebCopy does not include a virtual DOM or any form of JavaScript parsing. If a website makes heavy use of JavaScript to operate, it is unlikely WebCopy will be able to make a true copy if it is unable to discover all the website due to JavaScript being used to dynamically generate links. WebCopy does not download the raw source code of a website, it can only download what the HTTP server returns. While it will do its best to create an offline copy of a website, advanced data driven websites may not work as expected once they have been copied.

Octoparse is a new modern visual web data extraction software. It provides users a point-&-click UI to develop extraction patterns, so that scrapers can apply these patterns to structured websites. Both experienced and inexperienced users find it easy to use Octoparse to bulk extract information from websites — for most scraping tasks no coding needed! Octoparse, being a Windows application, is designed to harvest data from both static and dynamic websites (including those whose web pages that use Ajax). The software simulates human operation to interact with web pages. To make data extraction easier, Octoparse features filling out forms, entering a search term into the text box, etc. You can run your extraction project either on your own local machine (Local Extraction) or in the cloud (Cloud Extraction). Octoparse’s cloud service, being available only in paid editions though, works well for harvesting large amounts of data to meet large-scale extraction needs. There are various export formats of your choice like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle). For pricing model click here.

Content Grabber is a powerful, multi-featured web scraping solution with web automation capabilities. Content Grabber is developed and published by Sequentum. Content Grabber was designed from the very beginning with performance and scalability as the top priority. Multi-threading is used wherever appropriate to limit common web scraping bottlenecks such as web page retrieval. The Content Grabber agent editor has a typical point and click user interface where you click on the content you want to extract, or on the buttons and links you want to follow. Content Grabber is designed to manage hundreds of agents in a professional web scraping environment with development, testing and productions servers. Logs, schedules and status information for all agents can be managed in one centralized location, and all proxies, database connections and script libraries can be managed on a per server basis. Content Grabber has a fully fledged built-in script editor with IntelliSense that is more than capable when building smaller scripts. You can download trial version and for pricing model visit Content Grabber.

Scrapinghub is a cloud-based data extraction tool that helps thousands of developers to fetch valuable data. Its open source visual scraping tool, allows users to scrape websites without any programming knowledge. Scrapinghub uses Crawlera, a smart proxy rotator that supports bypassing bot counter-measures to crawl huge or bot-protected sites easily. It enables users to crawl from multiple IPs and locations without the pain of proxy management through a simple HTTP API. Scrapinghub converts the entire web page into organized content. Its team of experts are available for help in case its crawl builder can’t work your requirements. For pricing model click here.

WebHarvy is a point-and-click web scraping software. It’s designed for non-programmers. WebHarvy can automatically scrape Text, Images, URLs & Emails from websites, and save the scraped content in various formats. It also provides built-in scheduler and proxy support which enables anonymously crawling and prevents the web scraping software from being blocked by web servers, you have the option to access target websites via proxy servers or VPN. Users can save the data extracted from web pages in a variety of formats. The current version of WebHarvy Web Scraper allows you to export the scraped data as an XML, CSV, JSON or TSV file. User can also export the scraped data to an SQL database. Free trail version available for download and for pricing model visit official website.

That’s all for today. I hope you guys liked it and learn something new today. I’ll be back soon with some new technology explained in simple terms. Until then, take care 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s