About
Waseda Univ. and Baidu Collaborated Web Crawler

Yamana Lab., Computer Science and Engineering Div., Waseda Univ.
Baidu, Inc.

(Oct.2008)

If you have any problems or questions、please send the e-mail to
When you send us a e-mail, please include your Web server's URL.

IP address used by our W_Univ_BJ_spider

IP Addresses of W_Univ_BJ_spider

119.63.193.209

About our Project

　Our research project is a part of collaborated research between Waseda University and Baidu, Inc. Our crawler gathers Web pages mainly distributed from Japan in order to analyze their contents for our research. A part of gathered Web pages will be indexed for Baidu search engine.

About our Web Crawler

The frequency of gathering for each Web site is automatically adjusted not to put heavy load to the Web server. You may specify the minimal time intervall to access your Web site in the file /robots.txt .
The crawler/spider name is "W_Univ_BJ_spider(http://www.yama.info.waseda.ac.jp/~yamana/WBJ/)"
Firstly, our crawler sends a HTTP HEAD-request to your Web server. Then, our crawler checks whether the page is updated or not by analizing its headder. In case the page is updated, our crawler will gather the page by sending a GET-request.
If you want to deny accesses from our crawler, please put /robots.txt on your Web server or include META-tags in the Web pages not to be indexed or not to follow links.

How to deny accesses from our Crawler

Our crawler decides propriety of accesses to your Web pages based on "A Standard for RobotExclusionhttp://www.robotstxt.org/orig.html" by analyzing /robots.txt and META-tags.
When you want to deny all the accesses to your Web server, please include the following in /robots.txt .

User-Agent: W_Univ_BJ_spider
disallow: /

When you want to deny a spefic file type (e.g. pdf), please include the following in in /robots.txt .

User-Agent: W_Univ_BJ_spider
disallow: /*.pdf$

"*" matches any characters
"$" matches any characters placed at the end of URL.
In the above example, all URLs including the spesified characters, such as "abc.pdff", will be matched if "$" is missed.

When you want specify the time interval to access your Web site, please use "crawl-delay" option. You can specify the time interval with a unit of second. The following is the example to specify 10 sec. interval access.

User-Agent: W_Univ_BJ_spider
crawl-delay:600

robots.txt will be cached by our crawler at least 24 hours, 72 hours in average. It depends on the frequency of gathering.

User-agent: e-SocietyRobot
Disallow: /