Automated Data Acquisition for Cybersecurity

By Karolis Toleikis, Chief Executive Officer at IPRoyal.

Monday, 22nd July 2024 Posted 1 year ago in Security + Compliance by Phil Alsop

While the Big Data revolution has now made way for the AI revolution, we still learned plenty of lessons from the former. In some sense, the former revolution has ended not because it has failed, but because data has become so commonplace that there’s nothing left to write home about.

A major victory in the changes we experienced when everyone was focused on Big Data is that automated data collection solutions became much more accessible to everyone. Instead of being relegated to large corporations, now even SMBs can collect data at scale.

In turn, that made data more available to a wider range of businesses. Cybersecurity is one such sphere where automated data acquisition has netted immense benefits.

Finding needles in haystacks

Big Data is definitely far from the most important aspect of cybersecurity, compared to keeping a watchful eye on emerging threats, hacker discussions, new exploits, data leaks, and much more. Yet, the internet is vast, decentralized and scattered all over the place – you won’t find all of the above on one website.

Couple that with the existence of the deep web, a regularly inaccessible part of the internet, and there’s potentially millions of pages with thousands of candidates cropping up each day. Manual labor would be so cost prohibitive that no cybersecurity company could reasonably keep track of everything.

Automated data extraction (or web scraping) largely eliminates the problem. If we simplify the process by a large margin, all that happens is that a bot (or automated script) visits a web page and downloads the HTML file.

Within the HTML file is stored all of the information on that page. You can then use data analysis tools to wrangle the valuable information (such as forum posts) out of the file. Repeating the process thousands of times might take just a minute or two.

There’s a caveat, though, as sending requests to the same websites at a rate of a thousand per minute makes it obvious that something is amiss (even if the intentions are good). Websites will frequently ban the IP address of such an actor, making it impossible to get data.

As such, even cybersecurity companies turn to residential proxy providers. These businesses provide access to large pools of IP addresses where cybersecurity companies can avoid bans and switch to different perceived geographical locations with ease.

After successfully running through thousands of pages, you get a database that can be searched through for interesting discussions, links, files, and anything else. It can be appended, updated or expanded upon in minutes. Some companies may even run real-time tools that constantly monitor specific places on the web.

With automated data extraction, cybersecurity companies can run monitoring on anything they deem interesting. And many of them already do.

A common application is detecting data dumps or leaks. If you’ve ever received a warning about a new data leak that has been published – it’s likely that the leak has been found through web scraping and verified by a human.

That’s not the only way to apply automated data extraction, however.

Categorizing phish

Based on various estimates, over 250 000 websites are being created on a daily basis. While most of them are legitimate ones, run by individuals or businesses, some fraction of those websites will be created with the intention of stealing data.

One method that appears rather frequently is to create a website that pretends to be a bank or payment provider. All of the imagery, information, and branding is made to be as close as possible to the real website. But the payment system, instead of running legitimate transactions, sends sensitive information to a malicious actor.

Some of these websites may be so advanced that they redirect the payment to a legitimate provider through various methods to further confuse the user.

Such attacks affect thousands of users on a daily basis, successfully or not. Cybersecurity companies are working around the clock to do their best to protect unsuspecting users from phishing attacks.

Dedicated scraping solutions can be built to manage these processes easier. There’s an interesting game at play – attackers have to use well-known brands and websites to trick the most users. But that greatly limits their capabilities, which makes it easier for cybersecurity companies to detect malicious websites.

They can run scraping tools that compare images and text against legitimate websites that belong to well-known brands. If enough of a match is detected, they may verify manually that the website is impersonating something else.

Such a website could then be added to a database of known phishing domains and sent to responsible parties to take it down as soon as possible.

Brand protection through data

While brand protection is not what most envision as classic cybersecurity, the former is often put under the umbrella of the latter. Most of what happens with brand infringement, counterfeiting, and activities of similar ilk faces the same problems as threat and vulnerability scanning – the internet is simply too vast for manual work.

Automated data extraction is, as such, commonly used to scan through large swaths of websites to find potential infringements. It’s, however, often meant to target various marketplaces, classified ads platforms, and other peer-to-peer sales websites.

A wide array of strategies are implemented to discover potential counterfeiters or infringers. Digital infringement, however, is somewhat different from physical counterfeits. The former is a little easier in the sense that it’s easier to detect identical (or highly similar) image usage through the internet.

Physical items are harder to detect as specialists have to rely on photos or descriptions. There’s fewer clear-cut cases when all you have is a written description and a few photos.

While counterfeiting may seem like a luxury-product issue where someone makes fake Gucci handbags, that’s mostly a myth. There are much more potentially harmful and damaging products being counterfeited – a great example is pharmaceuticals. Up to 41% of drugs in some countries may be, in one way or another, counterfeit.

So, while fake luxury products may harm someone’s wallet, counterfeit drugs may directly harm someone’s health. As such, web scraping is instrumental in not only protecting people from scams and phishing attempts, but may be even used to protect someone’s health.

Cybersecurity has greatly benefited from the advancements in data collection practices. Automation such as web scraping, enables them to create advanced monitoring systems for a wide variety of purposes, ranging from simple hacker forum monitoring to detecting counterfeit items.

While now everyone is turning to AI, we shouldn’t forget that there’s still a lot left to explore in the intersection of web scraping and cybersecurity. Both of these fields can benefit each other greatly.

Automated Data Acquisition for Cybersecurity

By Karolis Toleikis, Chief Executive Officer at IPRoyal.

AI in the SOC: why complete autonomy is the wrong goal

Managing persistent exposure: why APT defence requires a strategic shift

When AI hacks AI, the victims are still human

Creating Pathways into Tech for Girls Means Starting Early

Cyber Resilience in 2026: Designing Security for Real-World Behaviour

Preparing Cryptography for the Quantum Era: Why Waiting Is the Biggest Risk

Protecting mission-critical networks from next-generation threats

WAN acceleration: when to romance your data