•  

    February 2012
    M T W T F S S
    « Oct   Mar »
     12345
    6789101112
    13141516171819
    20212223242526
    272829  
  • NMM Newswire

  • Pages

  • Twits

Technical Note | Focused Crawling

Although I tend to blog on this subject more often on my — what passes for — Portuguese-language blog, there is a method to at least some of the madness that emerges from scribbling down every brains fart that comes down the pike, that universal risk of the blogerati reader.

After combing through Scholar.Google a fair amount, I believe what I am trying to figure out is basically how to accomplish focused crawling — and post-crawling — in order to produce link sociologies that might mean something to a poetry-major readership.

One way to get started is to take a known or hypothesized, tightly related group of institutional actors, in terms of link sharing — first subtracting or otherwise compensating for the social network or echo machine effect where needed — and then use them to cast a relatively shallow, but broad, net to see whether your first impressions about institutional relations pan out. Do the Web strategies of institutional actors reflect their real-life configurations? What patterns are typical of SEO-optimized  Wirklichkeit?

In a crawl seeded so as to measure the overlap between some U.S. foundations and theiroffshore colleagues and collaborators, I found that he Cato Institute, for example, above, shows up as a regular linker to an intellectually coherent — some might say narrow — pipeline of content providers and promotors.

At this point, one might run a link analysis on the raw obtained with WIRE Web crawler as a rough measure of centrality.

The substantial showing of DEVIANTART.COM — below — in just about all crawls of this general nature is probably due to industrial-strength SEO and not to real live responsive readers and responders.

An interesting and  useful thing to do sometimes, in Pajek or gephi, is to compare before and after dynamics of a network cluster with and without the Facebook-Twitter effect.

Plagiarism googles can also yield some interesting contextual stats.

grep

and

csvtool

are used to format the data in usable form for Pajek and other analysis tools.

wire-info-extract -s > ALL.csv
produces full details on each of 160,000 sites for which entries were found by the crawler.
grep .br ALL.csv > BRORG.csv

returns 760 sites bearing the .org.br domain and writes it a file called BRORG.csv.

cvstool col 1,30 BRORG.csv > brorg.index

or what have you creates a new file bearing the site index number and domain name assigned byWIRE, for use in discovering link structures later.

As I recovered my previous configuration of this crawler, I found I had forgotten what method to use to recover the links table which, crossing site numbers with site names, produces a network file for Pajek and others to work on. The correct regular expression is

\(\<[1-9][0-9]*\>\)

in order to remove a site link count for each site number. This data remains available and researchable for each site indexed, using

wire-info-shell.

on the command line.

Tools used:

  1. WIRE
  2. WIRE-Nic
  3. yacy
  4. ORA
  5. Jwire
  6. Pajek
  7. gephi
  8. grep
  9. csvtool
  10. SocNetV
  11. Network Workbench
Follow

Get every new post delivered to your Inbox.