Security researchers found that datasets used by companies to develop large language models included API keys, passwords, and other sensitive credentials.
Large language models are dominating the online landscape, with companies promoting AI solutions that claim to solve all problems.
By infosecbulletin
/ Friday , June 12 2026
A security expert called brutecat shared how an AI-based testing system found over $500,000 in weak spots in Google’s systems...
Read More
By infosecbulletin
/ Friday , June 12 2026
Google has released a big security update for Chrome on desktops. Version 149.0.7827.114/.115 is now out for Windows and Mac....
Read More
By infosecbulletin
/ Thursday , June 11 2026
A security notice has revealed serious flaws in some Dahua products. Network admins need to fix these issues fast. The...
Read More
By infosecbulletin
/ Thursday , June 11 2026
South Korea's privacy regulator said on Thursday (June 11) that the country will fine e-commerce giant Coupang 625 billion won...
Read More
By infosecbulletin
/ Thursday , June 11 2026
Oracle PeopleSoft servers are under attack in ongoing data theft by the ShinyHunters gang, which claim to have stolen data...
Read More
By infosecbulletin
/ Wednesday , June 10 2026
Cybersecurity experts found several serious flaws this week in Windows, Chromium, OpenSSL, Microsoft Exchange, and ServiceNow. Some of these flaws...
Read More
By infosecbulletin
/ Wednesday , June 10 2026
GitHub disabled 73 repositories in four Microsoft groups: Azure, Azure-Samples, Microsoft, and MicrosoftDocs. Each repo now shows GitHub’s “This repository...
Read More
By infosecbulletin
/ Wednesday , June 10 2026
A security expert shared a new Microsoft Defender vulnerability called "RoguePlanet" only hours after Microsoft fixed two earlier problems in...
Read More
By infosecbulletin
/ Wednesday , June 10 2026
Microsoft's June 2026 Patch Tuesday updates fix about 200 security flaws found in the company's products. None of the flaws fixed...
Read More
By infosecbulletin
/ Tuesday , June 9 2026
The first business underwater data center run by offshore wind has started working near Shanghai. Submerged 10 metres under the...
Read More
For an AI to be effective, it needs extensive training data, much of which is gathered from the Internet by specialized companies and organizations.
Common Crawl provides datasets for companies to train their AI, gathering information from the internet, which may include sensitive data.
Researchers from Truffle Security discovered that credentials, API keys, and passwords are being exposed. The main issue is that some web developers hardcode sensitive information into websites, which then ends up in LLM training data.
Researchers discovered 11,908 live secrets, such as API keys and passwords, across 2.76 million websites.
“Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public Internet for organizations like Truffle Security to conduct this type of research,” explained the researchers.
Companies that create LLMs have warned against hardcoding sensitive information on websites. They advise avoiding this practice, as users may unintentionally share the code in their work, worsening the issue.