Security researchers found that datasets used by companies to develop large language models included API keys, passwords, and other sensitive credentials.
Large language models are dominating the online landscape, with companies promoting AI solutions that claim to solve all problems.
By F2
/ Tuesday , June 24 2025
The U.S. House of Representatives has banned congressional staff from using WhatsApp on government devices due to security concerns, as...
Read More
By F2
/ Tuesday , June 24 2025
Kaspersky found a new mobile malware dubbed SparkKitty in Google Play and Apple App Store apps, targeting Android and iOS....
Read More
By F2
/ Tuesday , June 24 2025
OWASP has released its AI Testing Guide, a framework to help organizations find and fix vulnerabilities specific to AI systems....
Read More
By F2
/ Tuesday , June 24 2025
In a major milestone for the country’s digital infrastructure, Axentec PLC has officially launched Axentec Cloud, Bangladesh’s first Tier-4 cloud...
Read More
By infosecbulletin
/ Monday , June 23 2025
A hacking group reportedly linked to Russian government has been discovered using a new phishing method that bypasses two-factor authentication...
Read More
By infosecbulletin
/ Wednesday , June 18 2025
Russian cybersecurity experts discovered the first local data theft attacks using a modified version of legitimate near field communication (NFC)...
Read More
By infosecbulletin
/ Tuesday , June 17 2025
Cybersecurity researcher Jeremiah Fowler discovered an unsecured database with 170,360 records belonging to a real estate company. It contained personal...
Read More
By infosecbulletin
/ Tuesday , June 17 2025
GreyNoise found attempts to exploit CVE-2023-28771, a vulnerability in Zyxel's IKE affecting UDP port 500. The attack centers around CVE-2023-28771,...
Read More
By infosecbulletin
/ Tuesday , June 17 2025
The U.S. Cybersecurity and Infrastructure Security Agency (CISA) has recently included two high-risk vulnerabilities in its Known Exploited Vulnerabilities (KEV)...
Read More
By infosecbulletin
/ Monday , June 16 2025
SafetyDetectives’ Cybersecurity Team discovered a public post on a clear web forum in which a threat actor claimed to have...
Read More
For an AI to be effective, it needs extensive training data, much of which is gathered from the Internet by specialized companies and organizations.
Common Crawl provides datasets for companies to train their AI, gathering information from the internet, which may include sensitive data.
Researchers from Truffle Security discovered that credentials, API keys, and passwords are being exposed. The main issue is that some web developers hardcode sensitive information into websites, which then ends up in LLM training data.
Researchers discovered 11,908 live secrets, such as API keys and passwords, across 2.76 million websites.
“Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public Internet for organizations like Truffle Security to conduct this type of research,” explained the researchers.
Companies that create LLMs have warned against hardcoding sensitive information on websites. They advise avoiding this practice, as users may unintentionally share the code in their work, worsening the issue.