Security researchers found that datasets used by companies to develop large language models included API keys, passwords, and other sensitive credentials.
Large language models are dominating the online landscape, with companies promoting AI solutions that claim to solve all problems.
By infosecbulletin
/ Wednesday , May 14 2025
Siemens issued a security advisory (SSA-047424) for two serious vulnerabilities—CVE-2025-26389 and CVE-2025-26390—impacting the OZW672 and OZW772 web servers. These servers...
Read More
By infosecbulletin
/ Wednesday , May 14 2025
Microsoft has released its Patch Tuesday updates for May 2025, addressing a total of 78 vulnerabilities across its product ecosystem,...
Read More
By infosecbulletin
/ Tuesday , May 13 2025
NID services in Bangladesh are temporarily suspended due to issues with delivering One-Time Passwords (OTP) needed to access the NID...
Read More
By infosecbulletin
/ Monday , May 12 2025
Google will pay about $1.4 billion to Texas to settle two lawsuits regarding location tracking and biometric data storage without...
Read More
By infosecbulletin
/ Friday , May 9 2025
YouTube has restricted access to at least four Bangladeshi television channels in India following a takedown request from the Indian...
Read More
By infosecbulletin
/ Friday , May 9 2025
Microsoft has fixed critical vulnerabilities in its core cloud services, including Azure Automation, Azure Storage, Azure DevOps, and Microsoft Power...
Read More
By infosecbulletin
/ Thursday , May 8 2025
The cyber threat landscape is rapidly changing, with a notable increase in ransomware activity in April 2025, driven by the...
Read More
By infosecbulletin
/ Thursday , May 8 2025
SonicWall has released patches for three security flaws in SMA 100 Secure Mobile Access appliances that could allow remote code...
Read More
By infosecbulletin
/ Thursday , May 8 2025
From April 2024 to April 2025, Flashpoint analysts noted that the financial sector was a major target for threat actors,...
Read More
By infosecbulletin
/ Thursday , May 8 2025
Cisco has issued a security advisory for a critical vulnerability in its IOS XE Software for Wireless LAN Controllers (WLCs)....
Read More
For an AI to be effective, it needs extensive training data, much of which is gathered from the Internet by specialized companies and organizations.
Common Crawl provides datasets for companies to train their AI, gathering information from the internet, which may include sensitive data.
Researchers from Truffle Security discovered that credentials, API keys, and passwords are being exposed. The main issue is that some web developers hardcode sensitive information into websites, which then ends up in LLM training data.
Researchers discovered 11,908 live secrets, such as API keys and passwords, across 2.76 million websites.
“Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public Internet for organizations like Truffle Security to conduct this type of research,” explained the researchers.
Companies that create LLMs have warned against hardcoding sensitive information on websites. They advise avoiding this practice, as users may unintentionally share the code in their work, worsening the issue.