• October 2013
    M T W T F S S
    « Sep   Nov »
     123456
    78910111213
    14151617181920
    21222324252627
    28293031  
  • Pages

  • Marginalia

  • Accumulations

.None Taken

Screenshot from 2013-10-30 12:46:48

Source: Gephi | C. Brayton

What do you know? I think I actually debugged something today

I have finally recovered the entirety of my amateur social network analysis toolbox. Having changed over to a 64-bit box — Running Debian Wheezy — I had to make sure tools that run under .wine — the Windows emulator for Linux — still work properly. They do, they do: especially Pajek.

Screenshot from 2013-10-30 12:46:33

Gephi is performing beautifully again now, at any rate. I ordinarily deploy the useful feature — *.net > *.csv — to aid in properly formatting Pajek network files for use with the yEd drawing tool, which I prefer to Gephi, dare I say it?.

Above, the «ecogentric» network of our online friend and colleague, journalist Luis Nassif. It is not that I believe Mr. Nassif vain and vainglorious. It simply shows a sample of the network which touches  Nassif or is touched by him, generally having to do with link reciprocity.

Screenshot from 2013-10-12 10:38:24

This interpretation  reflects social relations that various authors in the process of ongoing, formal debate on various economic issue, and therefore meets the test of homophily — the sharing of interests.

Me and the Chilean Code

THEBIGFRKIING0

In my recovery work, I  also ran into a frustrating snag with the Web crawler, WIRE — the invention of an ultracool Chilean IT guy calls himsself Chato and who now works in Qatar.

How to describe this briefly and non-technically?

WIRE is a Web crawler programmed in C and configured in a separate XML file called *.conf.

Configured are the four main processes of the crawler: Seeder, Manager, Harvester, Gatherer.

This is a user-friendly arrangement. It requires the setting of only a handful of variables, with plain-spoken comments and counsels, to be ready.

What got to be frustrating is that my intentions with my current work is to continually search for and explore new Web sites and pages.

Currently, for example, I have been focusing on the Web site Arts and Letters Daily — a daily zine since back in the day.

I wanted to view it as representing a network of like minds — and to what degree great minds think alike.

I reasoned that its front page, laid out like newspaper and with internal links in the hundreds, propitious to secondary echo-blogging, gives it a prestigious gatekeeper or other such brokerage role.

In my own scribbling on this topic, I call sites like ALDaily “virtual kiosks” or “digital newsstands” or what have you. They are the trading desk of civilization. I thought I mighty identify around 10,000 sites / URLs with varying degrees of proximity to the Big Seed.

And so I set the thing to working —

wire-bot-manager

As shown above, after a few hours, it was clear that the crawler was only aware of a limited number of sites to explore — above, responding to the value of

.none

in the variable

This .none is not neutral, as it was in another incarnation of the software.

Thus, for the Webographer, this is not all that useful, and presents an obstacle, while to the focused Web explorer its value is as the basis of a focused or a diffuse search.

That is to say, he who  wants to crawl for breadth and social density, let them leave the value or URL exclusion at

.none

Then if you want to

wire-bot-seeder --start seeds.txt

you can limit the TLDs you want to survey.

Seeds are the roots of any URL the robot runs across, as long as it is not instructed to confine its attention to a given domain.

The Chileans built this feature in because they were, like Nic-Brazil, engaged in the sociometric study of TLDs in various Latin American countries.

FSP Links2

I spent a lot of time with the sample.conf supplied by the Chileans and some of their Brazilian colleagues but could see the difference between worked and what did not.

Finally, it hit me: there was a certain variable that could be affecting the behavior of the manager, harvester, gatherer and seed. Below, a healthy result, if what you want to do is gather a constantly increasing universe of URLs.

Screenshot from 2013-10-28 07:21:31Correct

As to the error of using

.none

it might actually be a fortuitous feature.

Screenshot from 2013-10-30 12:46:48

In this variable, any value other than a valid TLD after initiating the crawl process will result in a focused crawl — based on its initial values and the evolution of its place in the network. The software designers have given the default value as

.node

THEBIGFRKIING0

Above, a correctly configured crawl, removing the depth-first blockage of files not immediately tied to the original, extra crispy URLs we fed it. Now we enter a variety of TLDs and then set WIRE to the task of crawling according to breadth and an escalating mass scale.

Eureka

But this might be something we actually want to do — focus a crawl on a small world and its internal connections with members of its own network.

As a tinkerer, I went in and tested that variable and found that, in face, a minimal list of TLDs must be set as options for the process to calculate.

Screenshot from 2013-10-28 12:04:39

At the same time, I take the advice of the manual and install a number of nameservers:

THEBIGFRKIING0

This means that zero sites were found other than those whose URL was resolved during the first running of the seeder.

I was leary of Google Public DNS at first, but we shall see.

MultiDNS

Now the manager schedules and organizes the next batch  to be harvested base on calculations that happen in the gatherer and the seeder. Again, the key setting is link_new_sites > 0.

But if we return to the file pointed to by your WIRE_CONF –mine is /opt/wiredata/CONF.conf ; export WIRE_CONF — and replace

.none

with a list of every TLD in the world — which is not difficult to obtain — then you obtain the desired result.

Screenshot from 2013-10-26 12:03:47

Above, the end of a healthy wire-bot-seeder run — the proliferation of sites certified as “+” means the crawl continues a breadth-first parsing of newly found sites  for the manager to schedule.

Screenshot from 2013-10-12 10:43:39

In Gephi, the content of the *.net file can be exported to *.csv, and the again, using OpenOffice, to an Excel-compliant spreadsheet.

At this point it is ready for everything the rough and ready yEd diagrammer can throw at it.

Gephi is more scientifically sophisticated by several miles — I have nightmares about the Scottish industrial cartel example —  but I find that yEd can perform a lot of the same analysis in a user friendly fashion.

Screenshot from 2013-10-30 15:12:32

So, do not answer .none as the value of .uswhen asked to set the scenario for a broadscope crawl.

Large sample of membership list of the World Association of Newspaper

A very large sample of a membership list of the World Association of Newspapers

theprob

Above, no man is .none.

infoshellprogresswatch

The crawl status and the linked environment of specific sites can be monitored with

wire-info-shell