The Perils of Google Search for Investigative Due Diligence

November 7, 2019

Although Google has cemented itself as the “go-to” online research tool, Google’s search technology does not have the same applicability to all research needs. Since Google Search’s focus is giving searchers answers to their questions within the first handful of results, updates have made Google less suitable for deep-dive due diligence work. This has led investigators to develop manual workarounds to uncover all the information they need. Nascent automated due diligence technology incorporating artificial intelligence (AI) offers researchers the chance to automate these time-consuming workarounds, giving them more time for analysis and synthesis.

How Google Search Has Changed

Google’s original breakthrough was the development of the “PageRank” algorithm, which incorporated results' relative popularity on the internet into the widely-used method of ranking based on keyword frequency. However, in the two decades since, Google’s search indexing and ranking methodologies (the full details of which are kept highly confidential) have undergone numerous changes. Moz, a major search engine optimization (SEO) company, maintains a list of confirmed and suspected major Google search algorithm changes, revealing that Google’s algorithms can change multiple times per day. SEO experts surmise that Google’s algorithms currently encompass nearly 300 signals that contribute to the ranking of results returned for a query.

For example, in 2013 Google released the Hummingbird update, a complete reengineering of Google Search's ranking engine. Hummingbird places greater emphasis on natural language queries and the implied intent behind the query ("semantic search") to determine how to return results. However, this meant that less emphasis was placed on matching keywords, which to this day remain hugely important to researchers. Since then, Google has continued to put more algorithmic emphasis on understanding a searcher's intent in order to filter and rank results, even as recently as October, when Google integrated its BERT language model in Search to better understand English-language natural language queries.

The Investigator’s Use Case

While algorithmic changes and ranking factors make Google Search a better product for the normal user, the same can’t necessarily be said for researchers who use Google to “dig deep” into online information. Even though Google and other search engines have been a transformative and ultimately indispensable part of any investigative toolkit, certain aspects of mass-market search technology can work against the investigator.

A perfect example of this is Google’s default setting of 10 results per page. According to an analysis conducted by SEO expert Brian Dean, less than 1% of searchers open results ranked lower than the first 10 results. This has led to a joke common among SEO experts that “the best place to hide a dead body is page two of Google search results.” Indeed, the SEO industry tends to focus on getting search results to appear in the top 10.

Investigative researchers must go far beyond the first page of Google’s result pages for any given search in a hunt for relevant information. This is particularly true for research in developing countries or countries where the mainstream news outlets may be subject to public and/or private censorship. In those cases, the most informative content may be found far beyond the top 10 results determined by Google.

Keywords, rather than natural language queries, and advanced search syntax, such as keyword proximity, are an investigator’s best tools when using search technology. By emphasizing concepts such as query intent, “mobile-friendliness,” geolocation and other factors important to the normal user, Google and other search engines have made it more difficult to find the information that can make or break an investigative assignment. Unfortunately, this is compounded by another feature of any open-ended search engine: the deliberate truncation of search results.

Google's Automatic Filtering Feature

If you ran a Google search on “Bill Gates” right now, you would only get 150 to 200 results, depending on your geographic location, search history and other factors. However, Google itself says that it has “About 285,000,000 results” for that query. Why the disparity?

Well, the initial results are those displayed after Google has removed what it deems to be duplicative results. A review of Google’s documentation for its custom search engine API (which lets you build mini-versions of Google targeted against some part of its index) reveals that Google applies “automatic filtering” against its search results, which 1) returns only the “most relevant” document of a set of multiple documents that appear to contain the same information; and 2) returns (or ranks lower) results that all come from the same domain, to prevent “host crowding,” where results from a single site overwhelm the total set of results.

By not showing certain results that Google deems duplicative or indicative of result stuffing by a single site, Google may end up withholding results that are extremely relevant to an investigator’s search. These search results often appear to be similar but are actually quite different. For example, regulatory filings may use large amounts of boilerplate language, or a result may be one of several associated filings, each of which provides a small update to the previous filing, but otherwise appear nearly identical.

By limiting the number of results that come from a given domain, a researcher may miss out on important results that didn’t “make the cut.” For example, a search may return a handful of results from a specific news outlet, but a targeted search of that outlet’s website may uncover dozens more news articles that are relevant to the subject of the investigation.

Although automatic filtering itself can be turned off on a per-search basis and anti-host-crowding mechanisms can be avoided by conducting searches restricted to a specific domain (e.g. “aml due diligence” site:kroll.com), these mechanisms require either manual intervention by a researcher and/or onerous repetition of searches against multiple domains. Furthermore, neither of these workarounds solve what investigative researcher Richard McEachin dubbed “The Search Engine Problem”¹: the purposeful truncation of search results returned to the searcher, which is how we got only 169 of 285 million search results for “Bill Gates” in Google’s index.

The Potential of Automated Due Diligence

Even after turning off automatic filtering, we get a total of some 400 results for “Bill Gates.” However, even those 400 do not represent Google’s entire result set for the query “Bill Gates,” nor is it possible to get more results for this search. Although we would hardly expect anyone to go through 285 million results (in this example), the 400 results we could retrieve may not even represent the best set of results for an investigator’s research question. This is why, in search engines like Google that attempt to index the whole internet, investigative researchers turn to targeted queries that seek to answer the question at hand through an iterative search approach, rather than trying to “boil the ocean” by making overly general searches intended to find all relevant pieces of information in one fell swoop. However, although this is best practice, it can still be very time-consuming.

This also explains why the application of artificial intelligence and natural language processing for parsing through online search results is transformative to the investigative due diligence process. An automated due diligence engine can concurrently run dozens, if not hundreds, of search engine queries, parse through the results' full content and cut down the set of thousands of results to the important few. When powered by artificial intelligence, the engine can even learn from these results to generate new queries, beginning the cycle anew. And this entire automated due diligence process can take only a matter of seconds or minutes, not hours or days. The end result uncovers the most relevant information for human analysts to review, in a fraction of the time it would take a researcher to find manually.

Emerging AI capabilities free up compliance and research experts for more thorough due diligence and high-level risk management tasks. However, organizations must still have a dedicated due diligence team and a risk assessment strategy that drives effective risk management processes. To take advantage of the benefits of AI, it’s important to understand what technology best fits into your program and finding the right provider.

Interested in learning more about how artificial intelligence is changing regulatory due diligence? This article is part of an ongoing series on AI and the future of regulatory compliance and due diligence. Stay tuned for subsequent articles on this emergent technology by subscribing below.

Source:
¹ McEachin, Richard B. Sources & Methods for Investigative Internet Research. Confidential Resource Press, 2013.

Stay Ahead with Kroll

Background Screening and Due Diligence

Comprehensive spectrum of background checks, screening and due diligence services.

Learn More

Kroll Compliance Portal

To make effective business decisions at scale, you need a compliance program grounded in objectivity and efficiency. Kroll Compliance Portal is your single platform for managing all aspects of a third-party compliance program.

Learn More