OSx -Tor Web Crawler Project
OSx Curl .onion sites -how 2 guide- Tor Web Crawler Project
gATO hAs - been looking into mapping the Tor -.onion network crawling it from aA to zZ , from 1-7 all 16 digits. I use OSx for most of my work and I wanted to curl an .onion site and check it out. As I dug around I found that if I just check my Vidalia.app it will show me were everything is located. Then the fun begins
find your /TorBrowser_en-US-6.app then click and look at the file Info then go to: TorBrowser_en-US-6.app/Contents/MacOS/
cd - TorBrowser_en-US-6.app/Contents/MacOS/
once here :
- this will show you the files
ls -fGo
total 5976
drwxr-xr-x 7 richardamores 238 Jun 8 07:11 .
drwxr-xr-x 7 richardamores 238 Feb 19 06:54 ..
drwxr-xr-x 3 richardamores 102 Feb 19 06:54 Firefox.app
-rwxr-xr-x 1 richardamores 3045488 Feb 19 06:54 tor
-rwxr-xr-x 1 richardamores 1362 Feb 19 06:54 TorBrowserBundle
drwxr-xr-x 4 richardamores 136 Feb 19 06:54 Vidalia.app
-rw-r–r– 1 richardamores 6435 Jun 8 07:11 VidaliaLog-06.08.2012.txt
Now I fire up the tor application ./tor
Next open up another Terminal box and check to see if Tor port is open and LISTENing on port 9050
netstat -ant | grep 9050 # verify Tor is running
Once you can see port 9050 LISTEN then your ready to use curl—
curl -ivr –socks4a 127.0.0.1:9050 http://utup22qsb6ebeejs.onion/
curl -ivr –socks4a 127.0.0.1:9050 http://nwycvryrozllb42g.onion
curl -ivr –socks4a 127.0.0.1:9050 http://2qd7fja6e772o7yc.onion/
curl -ivr –socks4a 127.0.0.1:9050 http://5onwnspjvuk7cwvk.onion/
curl -ivr –socks4a 127.0.0.1:9050 http://6sgjmi53igmg7fm7.onion/
curl -ivr –socks4a 127.0.0.1:9050 http://6vmgggba6rksjyim.onion/
Here are a few site that you can check out:../ curl is just one of those tools that keeps on giving and of course if I can get one APP to work thru Tor on OSx, then I can get other apps to use Tor as a proxy for all my line command –time to have some fun- gATO oUt
Lab -Notes
- sudo apt-get install tor
- sudo /etc/init.d/tor start
- netstat -ant | grep 9050 # verify Tor is running
here is a good crawler to play with
<?php
$ch = curl_init(‘http://google.com’);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, ‘https://127.0.01:9050/’);
curl_exec($ch);
curl_close($ch);
<?php
$ch = curl_init(‘http://google.com’);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
// Socks5
curl_setopt($ch, CURLOPT_PROXY, “localhost:9050″);
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
curl_exec($ch);
curl_close($ch);
Tor Web Crawler
http://stackoverflow.com/questions/9237477/tor-web-crawler
did not work – netstat shows it on socks4 not socks5
curl -s –socks5-local 127.0.0.1:9050 –user-agent “Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.3) \ Gecko/20100401 Firefox/3.6.3″ -I http://utup22qsb6ebeejs.onion/
turn on ToR
Run /Users/gatomalo/Downloads/TorBrowser_en-US-6.app/Contents/MacOS/tor
cd /Users/gatomalo/Downloads/TorBrowser_en-US-6.app/Contents/MacOS
./tor
now check for 9050 running proxy
netstat -ant | grep 9050
Now run your network commands thru socks port 9050
./Users/gatomalo/Downloads/TorBrowser_en-US-6.app/Contents/MacOS/tor
ls -fGo
total 5976
drwxr-xr-x 7 richardamores 238 Jun 8 07:11 .
drwxr-xr-x 7 richardamores 238 Feb 19 06:54 ..
drwxr-xr-x 3 richardamores 102 Feb 19 06:54 Firefox.app
-rwxr-xr-x 1 richardamores 3045488 Feb 19 06:54 tor
-rwxr-xr-x 1 richardamores 1362 Feb 19 06:54 TorBrowserBundle
drwxr-xr-x 4 richardamores 136 Feb 19 06:54 Vidalia.app
-rw-r–r– 1 richardamores 6435 Jun 8 07:11 VidaliaLog-06.08.2012.txt
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
curl -S –socks5-hostname 127.0.0.1:9050 -I http://utup22qsb6ebeejs.onion/
HTTP/1.1 200 OK
Date: Thu, 12 Jul 2012 17:49:49 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.2
Set-Cookie: fpsess_fp-a350e65d=8hg0upuuhcpuf4pgvg45l9c2b2; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>
<html xmlns=”http://www.w3.org/1999/xhtml”>
<head>
<title>My Hidden Blog</title>
<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />
<!– start of jsUtils –>
<script type=”text/javascript” src=”http://utup22qsb6ebeejs.onion/fp-plugins/jquery/res/jquery-1.4.2.min.js”></script>
<script type=”text/javascript” src=”http://utup22qsb6ebeejs.onion/fp-plugins/jquery/res/jquery-ui-1.8.2.custom.min.js”></script>
<!– end of jsUtils –>
<!– FP STD HEADER –>
<meta name=”generator” content=”FlatPress fp-0.1010.1″ />
<link rel=”alternate” type=”application/rss+xml” title=”Get RSS 2.0 Feed” href=”http://utup22qsb6ebeejs.onion/?x=feed:rss2″ />
<link rel=”alternate” type=”application/atom+xml” title=”Get Atom 1.0 Feed” href=”http://utup22qsb6ebeejs.onion/?x=feed:atom” />
<!– EOF FP STD HEADER –>
<!– FP STD STYLESHEET –>
<link media=”screen,projection,handheld” href=”http://utup22qsb6ebeejs.onion/fp-interface/themes/leggero/leggero/res/style.css” type=”text/css” rel=”stylesheet” /><link media=”print” href=”http://utup22qsb6ebeejs.onion/fp-interface/themes/leggero/leggero/res/print.css” type=”text/css” rel=”stylesheet” />
<!– FP STD STYLESHEET –>
Some other curl switches =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
–connect-timeout <seconds>
Maximum time in seconds that you allow the connection to the server to take. This only limits the con-
nection phase, once curl has connected this option is of no more use. See also the -m/–max-time
option.
If this option is used several times, the last one will be used.
-D/–dump-header <file>
Write the protocol headers to the specified file.
This option is handy to use when you want to store the headers that a HTTP site sends to you. Cookies
from the headers could then be read in a second curl invocation by using the -b/–cookie option! The
-c/–cookie-jar option is however a better way to store cookies.
When used in FTP, the FTP server response lines are considered being “headers” and thus are saved
there.
If this option is used several times, the last one will be used.
-f/–fail
(HTTP) Fail silently (no output at all) on server errors. This is mostly done to better enable scripts
etc to better deal with failed attempts. In normal cases when a HTTP server fails to deliver a docu-
ment, it returns an HTML document stating so (which often also describes why and more). This flag will
prevent curl from outputting that and return error 22.
This method is not fail-safe and there are occasions where non-successful response codes will slip
through, especially when authentication is involved (response codes 401 and 407).
–ssl
(FTP, POP3, IMAP, SMTP) Try to use SSL/TLS for the connection. Reverts to a non-secure connection if
the server doesn’t support SSL/TLS. See also –ftp-ssl-control and –ssl-reqd for different levels of
encryption required. (Added in 7.20.0)
This option was formerly known as –ftp-ssl (Added in 7.11.0) and that can still be used but will be
removed in a future version.
-H/–header <header>
(HTTP) Extra header to use when getting a web page. You may specify any number of extra headers. Note
that if you should add a custom header that has the same name as one of the internal ones curl would
use, your externally set header will be used instead of the internal one. This allows you to make even
trickier stuff than curl would normally do. You should not replace internally set headers without know-
ing perfectly well what you’re doing. Remove an internal header by giving a replacement without content
on the right side of the colon, as in: -H “Host:”.
curl will make sure that each header you add/replace is sent with the proper end-of-line marker, you
should thus not add that as a part of the header content: do not add newlines or carriage returns, they
will only mess things up for you.
See also the -A/–user-agent and -e/–referer options.
This option can be used multiple times to add/replace/remove multiple headers.
-o/–output <file>
Write output to <file> instead of stdout. If you are using {} or [] to fetch multiple documents, you
can use ‘#’ followed by a number in the <file> specifier. That variable will be replaced with the cur-
rent string for the URL being fetched. Like in:
curl http://{one,two}.site.com -o “file_#1.txt”
or use several variables like:
curl http://{site,host}.host[1-5].com -o “#1_#2″
You may use this option as many times as the number of URLs you have.
See also the –create-dirs option to create the local directories dynamically. Specifying the output as
‘-’ (a single dash) will force the output to be done to stdout.
-r/–range <range>
(HTTP/FTP/SFTP/FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server
or a local FILE. Ranges can be specified in a number of ways.
0-499 specifies the first 500 bytes
500-999 specifies the second 500 bytes
-500 specifies the last 500 bytes
9500- specifies the bytes from offset 9500 and forward
0-0,-1 specifies the first and last byte only(*)(H)
500-700,600-799
specifies 300 bytes from offset 500(H)
100-199,500-599
specifies two separate 100-byte ranges(*)(H)
-v/–verbose
Makes the fetching more verbose/talkative. Mostly useful for debugging. A line starting with ‘>’ means
“header data” sent by curl, ‘<’ means “header data” received by curl that is hidden in normal cases,
and a line starting with ‘*’ means additional info provided by curl.
Note that if you only want HTTP headers in the output, -i/–include might be the option you’re looking
for.
If you think this option still doesn’t give you enough details, consider using –trace or –trace-ascii
instead.
This option overrides previous uses of –trace-ascii or –trace.
Use -s/–silent to make curl quiet.








