Devlog #1 - Introduction to the project

I’m currently fascinated by OSINT and have been loving web scraping recently. I’ve been looking for more niche projects to work on that involve both, and I came across this project.

The project discovery

The mass scraper started as a simple recursive OSINT tool that implemented a weighted directed acyclic graph (Weighted DAG) to map user information scraped from various sources. The weight on the DAG was used to store the confidence level of the connection between data. This meant that if conflicting data was found, I was able to identify the most likely data source.

After starting this, I noticed a huge flaw: I was only able to look up data that the site allowed. For example, if a site accepts a username as input and allows you to see a full name, you could never do a reverse lookup. This led me to discover a completely new path for the project — why not scrape all the users on a website and store them in a database? This way, I can perform any type of lookup I want.

Development of the project

I wanted to start with Discord IDs, which are unique identifiers for a forums platform. I chose this because it’s a unique identifier and there are no other good OSINT tools around for Discord since spy.pet shut down.

I discovered multiple sites that allowed me to check a username, and if a Discord account was linked, it would even show the Discord ID. This is perfect for web scraping — all I need to do now is scrape all the users.

After coding and running these scrapers for a week or two, I was able to scrape around 35,000 Discord IDs. I also scraped 163 users’ crypto wallets, 26,585 users’ social media accounts, and 8,894 Valorant usernames.

The next stage

After scraping a few Discord IDs linked to other information, I chose to index a couple of leaked data breaches (part of OSINT) related to Discord/Roblox. Using the Discord ID from the data breach, I queried my database — if it’s found, then both results are merged and updated.

After indexing all of these databases, I was able to build a collection of 185,741 Discord IDs. I found 381 duplicates — this means these users were found both in the scraper and in the database. This allows us to index emails to social media accounts and perform reverse lookups.

Still not finished scraping

The scrapers are running perfectly, but I’m still not satisfied with the results. I now turn to UK data (where I’m from). I already know a couple of sources I use for OSINT investigations related to UK entities, so I decided to scrape Companies House.

Companies House’s ToS state that you can scrape or do whatever you want with the data — perfect. I found a list of 5.6 million companies, downloaded the CSV, and parsed it so I just have to store the company ID.

After parsing the CSV provided by Companies House, I now have a list of 5.6 million company IDs ready to scrape. In the first two days of scraping, I was able to collect 162,699 unique people’s DoBs, past or current addresses, and full names.

What does the future hold for this project?

I’m looking to scrape some more websites, focusing even more on UK data. The goal is to store a significant portion of UK individuals in the database.

The current database stats are: 0.23% of the current UK population, and 0.075% of the Discord user base. Still a long way to go — more scrapers and more data to focus on.