• December 2015
    M T W T F S S
    « Nov   Jun »
  • Pages

  • Marginalia

  • Accumulations

  • Advertisements

Crawling IVC with Navicrawler & WIRE


Source: IVC (Instituto Vericador de Comunicação), the Brazilian publishing and media industry circulation institute — led principally by big advertisers, major advertising agencies, and media groups (Abril, RBS, Editora Globo, Estado de S. Paulo, Infoglobo, Diários Associados) …

The institute was  known as the Instituto Verificadoir de Circulação until May 2015. Its services encompass …

  1. [Free publications]
  2. [Daily newspapers]
  3. [Magazines]
  4. [B2B]
  5. [Web sites]
  6. [Events (log-in required) ]

The IVC Web is extremely valuable as a «hub and authority» Web site, enabling the creation of an extensive if partial roster of the industries it covers.

First, I tried using Heritrix to explore these links but am still finding it difficult to set up and use. Likewise with HTTrack and wget –recursive –spider.

In the end I produced a set of nearly 600 data points using the extremely useful Navicrawler for Portable Firefox — please update this marvelous tool, mes ami! — and used it to seed a WIRE crawl of these data points.