Source: Gephi | C. Brayton
What do you know? I think I actually debugged something today
I have finally recovered the entirety of my amateur social network analysis toolbox. Having changed over to a 64-bit box — Running Debian Wheezy — I had to make sure tools that run under .wine — the Windows emulator for Linux — still work properly. They do, they do: especially Pajek.
Gephi is performing beautifully again now, at any rate. I ordinarily deploy the useful feature — *.net > *.csv — to aid in properly formatting Pajek network files for use with the yEd drawing tool, which I prefer to Gephi, dare I say it?.
Above, the «ecogentric» network of our online friend and colleague, journalist Luis Nassif. It is not that I believe Mr. Nassif vain and vainglorious. It simply shows a sample of the network which touches Nassif or is touched by him, generally having to do with link reciprocity.
This interpretation reflects social relations that various authors in the process of ongoing, formal debate on various economic issue, and therefore meets the test of homophily — the sharing of interests.
Me and the Chilean Code
In my recovery work, I also ran into a frustrating snag with the Web crawler, WIRE — the invention of an ultracool Chilean IT guy calls himsself Chato and who now works in Qatar.
How to describe this briefly and non-technically?
WIRE is a Web crawler programmed in C and configured in a separate XML file called *.conf.
Configured are the four main processes of the crawler: Seeder, Manager, Harvester, Gatherer.
This is a user-friendly arrangement. It requires the setting of only a handful of variables, with plain-spoken comments and counsels, to be ready.
What got to be frustrating is that my intentions with my current work is to continually search for and explore new Web sites and pages.
Currently, for example, I have been focusing on the Web site Arts and Letters Daily — a daily zine since back in the day.
I wanted to view it as representing a network of like minds — and to what degree great minds think alike.
I reasoned that its front page, laid out like newspaper and with internal links in the hundreds, propitious to secondary echo-blogging, gives it a prestigious gatekeeper or other such brokerage role.
In my own scribbling on this topic, I call sites like ALDaily “virtual kiosks” or “digital newsstands” or what have you. They are the trading desk of civilization. I thought I mighty identify around 10,000 sites / URLs with varying degrees of proximity to the Big Seed.
And so I set the thing to working —
As shown above, after a few hours, it was clear that the crawler was only aware of a limited number of sites to explore — above, responding to the value of
in the variable
This .none is not neutral, as it was in another incarnation of the software.
Thus, for the Webographer, this is not all that useful, and presents an obstacle, while to the focused Web explorer its value is as the basis of a focused or a diffuse search.
That is to say, he who wants to crawl for breadth and social density, let them leave the value or URL exclusion at
Then if you want to
wire-bot-seeder --start seeds.txt
you can limit the TLDs you want to survey.
Seeds are the roots of any URL the robot runs across, as long as it is not instructed to confine its attention to a given domain.
The Chileans built this feature in because they were, like Nic-Brazil, engaged in the sociometric study of TLDs in various Latin American countries.
I spent a lot of time with the sample.conf supplied by the Chileans and some of their Brazilian colleagues but could see the difference between worked and what did not.
Finally, it hit me: there was a certain variable that could be affecting the behavior of the manager, harvester, gatherer and seed. Below, a healthy result, if what you want to do is gather a constantly increasing universe of URLs.
As to the error of using
it might actually be a fortuitous feature.
In this variable, any value other than a valid TLD after initiating the crawl process will result in a focused crawl — based on its initial values and the evolution of its place in the network. The software designers have given the default value as
Above, a correctly configured crawl, removing the depth-first blockage of files not immediately tied to the original, extra crispy URLs we fed it. Now we enter a variety of TLDs and then set WIRE to the task of crawling according to breadth and an escalating mass scale.
But this might be something we actually want to do — focus a crawl on a small world and its internal connections with members of its own network.
As a tinkerer, I went in and tested that variable and found that, in face, a minimal list of TLDs must be set as options for the process to calculate.
At the same time, I take the advice of the manual and install a number of nameservers:
This means that zero sites were found other than those whose URL was resolved during the first running of the seeder.
I was leary of Google Public DNS at first, but we shall see.
Now the manager schedules and organizes the next batch to be harvested base on calculations that happen in the gatherer and the seeder. Again, the key setting is link_new_sites > 0.
But if we return to the file pointed to by your WIRE_CONF –mine is /opt/wiredata/CONF.conf ; export WIRE_CONF — and replace
with a list of every TLD in the world — which is not difficult to obtain — then you obtain the desired result.
Above, the end of a healthy wire-bot-seeder run — the proliferation of sites certified as “+” means the crawl continues a breadth-first parsing of newly found sites for the manager to schedule.
In Gephi, the content of the *.net file can be exported to *.csv, and the again, using OpenOffice, to an Excel-compliant spreadsheet.
At this point it is ready for everything the rough and ready yEd diagrammer can throw at it.
Gephi is more scientifically sophisticated by several miles — I have nightmares about the Scottish industrial cartel example — but I find that yEd can perform a lot of the same analysis in a user friendly fashion.
So, do not answer .none as the value of .uswhen asked to set the scenario for a broadscope crawl.
Above, no man is .none.
The crawl status and the linked environment of specific sites can be monitored with
Filed under: Brazil