Japanese Page
About
Waseda Univ. and Baidu Collaborated Web Crawler
Yamana Lab., Computer Science and Engineering Div., Waseda Univ.
Baidu, Inc.
(Oct.2008)
If you have any problems or questionsAplease send the e-mail to
When you send us a e-mail, please include your Web server's URL.
IP address used by our W_Univ_BJ_spider
IP Addresses of W_Univ_BJ_spider |
119.63.193.209 |
About our Project
@Our research project is a part of collaborated research between Waseda
University and Baidu, Inc. Our crawler gathers Web pages mainly distributed
from Japan in order to analyze their contents for our research. A part
of gathered Web pages will be indexed for Baidu search engine.
About our Web Crawler
- The frequency of gathering for each Web site is automatically adjusted
not to put heavy load to the Web server. You may specify the minimal time
intervall to access your Web site in the file /robots.txt .
- The crawler/spider name is "W_Univ_BJ_spider(http://www.yama.info.waseda.ac.jp/~yamana/WBJ/)"
- Firstly, our crawler sends a HTTP HEAD-request to your Web server. Then,
our crawler checks whether the page is updated or not by analizing its
headder. In case the page is updated, our crawler will gather the page
by sending a GET-request.
- If you want to deny accesses from our crawler, please put /robots.txt on your Web server or include META-tags in the Web pages not to be indexed
or not to follow links.
How to deny accesses from our Crawler
- Our crawler decides propriety of accesses to your Web pages based on "A
Standard for RobotExclusionhttp://www.robotstxt.org/orig.html" by analyzing /robots.txt and META-tags.
- When you want to deny all the accesses to your Web server, please include
the following in /robots.txt .
User-Agent: W_Univ_BJ_spider
disallow: / |
- When you want to deny a spefic file type (e.g. pdf), please include the
following in in /robots.txt .
User-Agent: W_Univ_BJ_spider
disallow: /*.pdf$ |
"*" matches any characters
"$" matches any characters placed at the end of URL.
In the above example, all URLs including the spesified characters, such
as "abc.pdff", will be matched if "$" is missed.
- When you want specify the time interval to access your Web site, please
use "crawl-delay" option. You can specify the time interval with
a unit of second. The following is the example to specify 10 sec. interval
access.
User-Agent: W_Univ_BJ_spider
crawl-delay:600 |
- robots.txt will be cached by our crawler at least 24 hours, 72 hours in
average. It depends on the frequency of gathering.
User-agent: e-SocietyRobot
Disallow: / |