Security researchers found that datasets used by companies to develop large language models included API keys, passwords, and other sensitive credentials.
Large language models are dominating the online landscape, with companies promoting AI solutions that claim to solve all problems.
By infosecbulletin
/ Wednesday , June 4 2025
IBM has issued a security advisory for vulnerabilities in its QRadar Suite Software and Cloud Pak for Security platforms. These...
Read More
By infosecbulletin
/ Wednesday , June 4 2025
As Bangladesh prepares for the extended Eid-ul-Adha holidays, the BGD e-GOV Computer Incident Response Team (CIRT) has issued an urgent...
Read More
By infosecbulletin
/ Tuesday , June 3 2025
In March 2025, the Threatfabric mobile Threat Intelligence team identified Crocodilus, a new Android banking Trojan designed for device takeover....
Read More
By infosecbulletin
/ Tuesday , June 3 2025
Qualcomm has issued security patches for three zero-day vulnerabilities in the Adreno GPU driver, affecting many chipsets that are being...
Read More
By infosecbulletin
/ Monday , June 2 2025
Roundcube Webmail has fixed a critical security flaw that could enable remote code execution after authentication. Disclosed by security researcher...
Read More
By infosecbulletin
/ Sunday , June 1 2025
A hacker known as "303" claim to breach the company's systems and leaked sensitive internal data on a dark web...
Read More
By infosecbulletin
/ Sunday , June 1 2025
CISA and ACSC issued new guidance this week on how to procure, implement, and maintain SIEM and SOAR platforms. SIEM...
Read More
By infosecbulletin
/ Saturday , May 31 2025
The Qualys Threat Research Unit (TRU) found two local information-disclosure vulnerabilities in Apport and systemd-coredump. Both issues are race-condition vulnerabilities....
Read More
By infosecbulletin
/ Saturday , May 31 2025
New ransomware payment reporting rules take effect in Australia yesterday (May 30) for all organisations with an annual turnover of...
Read More
By infosecbulletin
/ Saturday , May 31 2025
Global makers of surveillance gear have clashed with Indian regulators in recent weeks over contentious new security rules that require...
Read More
For an AI to be effective, it needs extensive training data, much of which is gathered from the Internet by specialized companies and organizations.
Common Crawl provides datasets for companies to train their AI, gathering information from the internet, which may include sensitive data.
Researchers from Truffle Security discovered that credentials, API keys, and passwords are being exposed. The main issue is that some web developers hardcode sensitive information into websites, which then ends up in LLM training data.
Researchers discovered 11,908 live secrets, such as API keys and passwords, across 2.76 million websites.
“Leaked keys in Common Crawl’s dataset should not reflect poorly on their organization; it’s not their fault developers hardcode keys in front-end HTML and JavaScript on web pages they don’t control. And Common Crawl should not be tasked with redacting secrets; their goal is to provide a free, public dataset based on the public Internet for organizations like Truffle Security to conduct this type of research,” explained the researchers.
Companies that create LLMs have warned against hardcoding sensitive information on websites. They advise avoiding this practice, as users may unintentionally share the code in their work, worsening the issue.