Using Google for Advanced Searching

What is Google Dorking?

Google has a lot of websites that have been crawled and indexed. Your average Joe uses Google to look up Cat pictures (I’m more of a Dog person myself…). While Google will have many Cat pictures indexed and ready to serve to Joe, this is a relatively trivial use of the search engine compared to what it can be used for.
For example, we can add operators such as those from programming languages to either increase or decrease our search results—or perform actions such as arithmetic!

Screenshot by the author from Google

For example, if we want to narrow down our search query, we can use quotation marks. Google will interpret everything between these quotation marks as exact and only return the results of the precise phrase provided…It is rather helpful to filter through the rubbish that we don’t need, as we have done so below:

Screenshot by the author from Google

So What Makes “Google Dorking” so Appealing?

First of all — and the critical part — it’s legal! It’s all indexed, publicly available information. However, what you do with this is where the question of legality comes into play…

A few standard terms we can search and combine include:

Screenshot by the author from THM

For example, let’s say we wanted to use Google to search for all PDFs on bbc.co.uk:

site:bbc.com filetype:pdf

Screenshot by the author from Google

Great! Now, we’ve refined our search for Google to query for all publicly accessible PDFs on “bbc.com.” You wouldn’t have found files like this “Freedom of Information Request Act” file from a wordlist!

Here, we used the extension PDF, but can you think of any other file formats of a sensitive nature that may be publicly accessible? (Often unintentionally!!) Again, what you do with any results you find is where the legality comes into play—this is why “Google Dorking” is so great/dangerous.

It may seem unnecessary to explain how search engines work, but there’s a lot more happening behind the scenes than meets the eye. Importantly, we can use this to uncover information that a simple word list wouldn’t provide. Research is fundamental in the field of cybersecurity and is a vital part of a pentester’s work. MuirlandOracle has developed an excellent learning resource that explores attitudes toward research and demonstrates the valuable information it can yield.

“Search Engines” such as Google are huge indexers — specifically, indexers of content spread across the World Wide Web.

These essentials in surfing the internet use “Crawlers” or “Spiders” to search for this content across the World Wide Web, which I will discuss in the next task.

Let’s Learn About Crawlers

What are Crawlers and How Do They Work?

These crawlers discover content through various means. One is pure discovery, where the crawler visits a URL and returns information regarding the website’s content type to the search engine. Modern crawlers scrape a lot of information—but we will discuss how this is used later. Another method crawlers use to discover content is following any URLs found from previously crawled websites. It is much like a virus; it will want to traverse/spread to everything it can.

Let’s Visualise Some Things…

The diagram below is a high-level abstraction of how these web crawlers work. Once a web crawler discovers a domain such as mywebsite.com, it will index the entire domain, looking for keywords and other miscellaneous information — but I will discuss this miscellaneous information later.

Screenshot by the author from THM

In the diagram above, “mywebsite.com” has been scraped as having the keywords “Apple,” “Banana,” and “Pear.” The crawler stores These keywords in a dictionary, which then returns these to the search engine, i.e., Google. Because of this persistence, Google now knows that the domain “mywebsite.com” has the keywords “Apple”, “Banana” and “Pear”. As only one website has been crawled, if a user searched for “Apple,”…“mywebsite.com” would appear. This would result in the same behavior if the user were to search for “Banana.” As the indexed contents from the crawler report the domain as having “Banana,” it will be displayed to the user.

As illustrated below, a user submits a query to the search engine for “Pears.” Because the search engine only has the contents of one website that has been crawled with the keyword “Pears,” it will be the only domain presented to the user.

Screenshot by the author from THM

However, as we previously mentioned, crawlers attempt to traverse, termed crawling, every URL and file that they can find! Say if “mywebsite.com” had the exact keywords as before (“Apple,” “Banana,” and “Pear”) but also had a URL to another website, “anotherwebsite.com,” the crawler will then attempt to traverse everything on that URL (anotherwebsite.com) and retrieve the contents of everything within that domain.

This is illustrated in the diagram below. The crawler initially finds “mywebsite.com,” where it crawls the contents of the website — finding the exact keywords (“Apple,” “Banana,” and “Pear”) as before, but it has additionally seen an external URL. Once the crawler is complete on “mywebsite.com,” it’ll proceed to crawl the contents of the website “anotherwebsite.com,” where the keywords (“Tomatoes,” “Strawberries,” and “Pineapples”) are found on it. The crawler’s dictionary now contains the contents of “mywebsite.com” and “anotherwebsite.com,” which are then stored and saved within the search engine.

Screenshot by the author from THM

So, to recap, the search engine now knows two domains that have been crawled:
1. mywebsite.com
2. anotherwebsite.com

However, note that “anotherwebsite.com” was only crawled because the first domain, “mywebsite.com, referenced it.” Because of this reference, the search engine knows the following about the two domains:

Screenshot by the author from THM

Now that the search engine has some knowledge about keywords say if a user were to search for “Pears,” the domain “mywebsite.com” will be displayed — as it is the only crawled domain containing “Pears”:

Screenshot by the author from THM

This is great…But imagine if a website had multiple external URLs (as they often do!). That’ll require a lot of crawling. There’s always the chance that another website might have information similar to that of another website crawled, right? So, how does the “Search Engine” decide on the domain hierarchy displayed to the user?

In the diagram below, if the user were to search for a keyword such as “Tomatoes” (which websites 1–3 contain), who would decide which website gets displayed in what order?

Screenshot by the author from THM

Answer the questions below.

Name the critical term of what a “Crawler” is used to do.

index

What technique does “Search Engines” use to retrieve this information about websites?

crawling

What is an example of content that could be gathered from a website?

keywords

Search Engine Optimization

Search Engine Optimisation, or SEO, is a prevalent and lucrative topic in modern-day search engines. So much so that entire businesses capitalize on improving a domain’s SEO “ranking.” From an abstract view, search engines will “prioritize” those domains that are easier to index. Many factors affect how “optimal” a domain is — resulting in something similar to a point-scoring system.

To highlight a few influences on how these points are scored, factors such as:

• How responsive your website is to the different browser types, I.e., Google Chrome, Firefox, and Internet Explorer — this includes Mobile phones!

• How easy it is to crawl your website (or if crawling is even allowed …but we’ll come to this later) using “Sitemaps.”

• What kind of keywords your website has (i.e., In our examples, if the user was to search for a query like “Colours,” no domain will be returned — as the search engine has not (yet) crawled a domain that has any keywords to do with “Colours.”

Various online tools — sometimes provided by search engine providers- will show you how optimized your domain is.

For instance, let’s use Google’s Site Analyzer to check Amazon.com’s rating.

Screenshot by the author

According to this tool, Amazon has an SEO rating of 70/100 (as of 15/05/2024). That’s not too bad, and the justifications for how this score was calculated are below on the page.

But…Who or What Regulates These “Crawlers”?

Aside from the search engines who provide these “Crawlers”, website/web-server owners themselves ultimately stipulate what content “Crawlers” can scrape. Search engines will want to retrieve everything from a website — but there are a few cases where we wouldn’t want all of the contents of our website to be indexed! Can you think of any…? How about a secret administrator login page? We don’t want everyone to be able to find that directory — primarily through a Google search.

Robots.txt

Similar to “Sitemaps,” which we will later discuss, this file is the first thing Crawlers index when visiting a website.

But what is it?

This file must be served in the root directory specified by the webserver. Since its extension is .txt, it’s safe to assume it is a text file.

The text file defines the permissions the “Crawler” has to the website. For example, what type of “Crawler” is allowed (I.e., You only want Google’s “Crawler” to index your site and not MSN’s)? Moreover, Robots.txt can specify what files and directories we do or don’t want to be indexed by the “Crawler.”

A very basic markup of a Robots.txt is like the following:

Screenshot by the author from THM

Here, we have a few keywords…

Screenshot by the author from THM

In this case:

1. Any “Crawler” can index the site

2. The “Crawler” is allowed to index the entire contents of the site

3. The “Sitemap” is located at http://mywebsite.com/sitemap.xml

Say we wanted to hide directories or files from a “Crawler”? Robots.txt works on a “blacklisting” basis. Unless told otherwise, the Crawler will index whatever it can find.

Screenshot by the author from THM

In this case:

1. Any “Crawler” can index the site

2. The “Crawler” can index every other content not contained within “/super-secret-directory/.”

Crawlers also know the differences between sub-directories, directories, and files. Such as in the case of the second “Disallow:” (“/not-a-secret/but-this-is/”)

The “Crawler” will index all the contents within “/not-a-secret/,” but will not index anything contained within the sub-directory “/but-this-is/.”

3. The “Sitemap” is located at http://mywebsite.com/sitemap.xml

What if we Only Wanted Certain “Crawlers” to Index our Site?

We can stipulate so, such as in the picture below:

Screenshot by the author from THM

In this case:

1. The “Crawler” “Googlebot” is allowed to index the entire site (“Allow: /”)

2. The “Crawler” “msnbot” is not allowed to index the site (Disallow: /”)

How about Preventing Files From Being Indexed?

While you can make manual entries for every file extension you don’t want to be indexed, you must provide the directory it is within and the whole filename. Imagine if you had a tremendous site! What a pain…Here’s where we can use a bit of regexing.

Screenshot by the author from THM

In this case:

1. Any “Crawler” can index the site

2. However, the “Crawler” cannot index any file that has the extension of .ini within any directory/sub-directory using (“$”) of the site.

3. The “Sitemap” is located at http://mywebsite.com/sitemap.xml

Why would you want to hide a .ini file, for example? Files like this contain sensitive configuration details. Can you think of any other file formats that might contain sensitive information?

Answer the questions below.

Where would “robots.txt” be located on the domain “ablog.com

ablog.com/robots.txt

If a website was to have a sitemap, where would that be located?

sitemap.xml

How would we only allow “Bingbot” to index the website?

User-agent: Bingbot

How would we prevent a “Crawler” from indexing the directory “/dont-index-me/”?

Dissallow: dont-index-me/

What is the extension of a Unix/Linux system configuration file that we might want to hide from “Crawlers”?

.conf

Thanks for reading.

CyberLuk3