Information about Agents

Why does 192.comAgent crawl my site?

The 192.comAgent crawls all UK websites and indexes contact information to allow that website to appear as the result of a search on 192.com.

What does 192.comAgent index?

The spider extracts contact information from visited websites in order to add this information into the 192.com database to help users get contact details of a business. The sign (for web) appears next to the extracted result on 192.com. Where available, a full url linking to the business website appears together with the other information for the searched business.

How do I stop 192Agent crawling all or parts of my site?

robots.txt is a standard document that can 'instruct' the 192bot not to download some or all information from your web server. The format of the robots.txt file is specified in the Robot Exclusion Standard. For detailed instructions about how to stop 192bot from crawling all or part of your site, please refer to our removal instructions. Remember, changes to your server's robots.txt file will not be immediately reflected in 192bot; they will be instigated next time the 192bot crawls your site.

How can I remove my entire website content from 192.com's index?

If you wish to exclude your entire website from 192's index, you can place a file at the root of your server called robots.txt. This is the standard protocol that most web crawlers observe for excluding a web server or directory from an index. More information on robots.txt is available here.

To remove your site from 192.com only and just exclude 192bot crawling your site in the future, place the following robots.txt file in your server root:

User-agent: 192.comAgent
Disallow: ⁄

To remove your site from search engines and prevent all robots from crawling it in the future, place the following robots.txt file in your server root:

User-agent: *
Disallow: ⁄

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow 192bot to index all http pages but no https pages, you'd use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):

User-agent: 192.comAgent
Allow: ⁄

For the https protocol (https://yourserver.com/robots.txt):

User-agent: 192.comAgent
Disallow: ⁄

Back to Top

How can I remove part of my website's content from 192.com's index?

Option 1: Robots.txt

To remove directories or individual pages of your website, you can place a robots.txt file at the root of your server. For information on how to create a robots.txt file, see the The Robot Exclusion Standard. When creating your robots.txt file, please keep the following in mind:

To remove all pages under a particular directory (for example, hypos), you need to use the following robots.txt entry:

User-agent: 192.comAgent
Disallow: ⁄

Option 2: Meta tags

Another standard, which can be more convenient for page-by-page use, involves adding a <META> tag to an HTML page to tell robots not to index the page. Instructions can be found here.

To prevent all robots from indexing a page on your site, you need to place the following meta tag into the <HEAD> section of your page:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

To allow robots to index the page on your site but instruct them not to follow outgoing links, you need to use the following tag:

<META NAME="ROBOTS" CONTENT="NOFOLLOW"> Back to Top

Why is 192bot trying to download incorrect links from my server or from a server that doesn't exist?

It is common that many links on the web will be broken or outdated at any particular time. Whenever someone publishes an incorrect link to your site (perhaps due to a typo or spelling error) or fails to update links to reflect changes in your server, 192bot will try to download an incorrect link from your site. This also explains why you may get hits on a machine that is not even a web server.

Back to Top

What are the IP addresses from which 192bot crawls so that I can filter my logs?

All our servers are located with a German ISP. The IP addresses are:

87.106.5.67    87.106.11.215

212.227.102.33    212.227.102.48

87.106.130.22    87.106.129.54


Back to Top

Why is 192bot downloading the same page on my site multiple times?

In principle, 192bot should only download one copy of each file from your site during a given crawl. Very occasionally the crawler is stopped and restarted, which may cause it to re-crawl pages recently retrieved.
The spider obeys the robots.txt files on a website to only extract data that the website owner wishes to make available to search engines.

Back to Top

What do I need to do to ensure indexing of contact details on my site?

To make sure that your website's contact data is captured by our spider, please make sure that all relevant contact data is on a page called contactus on your website and is clearly labeled to enable the spider to extract the relevant data. Basically any human readable format is appropriate as long as it have proper delimiters.

This is a correct format:

Company Name
Building
DoorNumber Street
Postcode
Village,Town,County
Telephone, fax, e-mail

or

Company Name, Building, DoorNumber Street, Postcode, Town, telephone, fax ...

This format is incorrect:

Company Name
Building DoorNumber Street
Postcode Village Town County
Telephone, fax, e-mail

There are no delimiters between Building and DoorNumber and Street, and also between Postcode and Village and Town and County.

Another example of incorrect format:
Company Name Building DoorNumber Street Postcode Village Town County ...

where there are no delimiters between your contact details.

Back to Top

Technical Information about the Spider

192bot has been written in C and C++. It is modified version of htcheck and htdig, using some modules from each. The technical platform is Linux.

To submit or remove your site from crawling, or to simply give us feedback on the spider please contact us.

Back to Top