Experts Raise Alarms Over DOD's Frontier AI Projects

Defense Department Awards Major AI Contracts

The Pentagon has awarded up to $800 million in contracts to OpenAI, Anthropic, Google, and xAI to integrate frontier artificial intelligence (AI) technologies into national security missions. These deals, each worth up to $200 million, aim to accelerate the Department of Defense’s (DOD) adoption of cutting-edge AI systems such as foundation models capable of complex tasks like natural language processing, computer vision, and reasoning.

Contents

Defense Department Awards Major AI Contracts

Concerns About Transparency and Testing

Raw Model Weights and Security Implications

Expert Warnings on Speed and Safety

Operational Risks and Experimental Deployments

The Importance of Independent Assessments

The Risk of Hallucinations and Accumulated Errors

Company Responses and Certification Standards

However, experts express serious concerns about transparency, safety measures, and the absence of rigorous testing and evaluation (T&E) practices. The DOD’s Chief Digital and AI Office (CDAO) has not clarified how these powerful models were assessed for responsible operational use before awarding the contracts.

Concerns About Transparency and Testing

In response to inquiries, a CDAO official stated that the contracts are designed to develop agentic AI workflows tailored to mission needs and that risk management is integral throughout the technology life cycle. However, no specific details were provided about how the models were vetted, leading to criticism from former military and AI experts.

Retired Lt. Gen. Jack Shanahan, the former head of the Joint Artificial Intelligence Center (JAIC), emphasized the need for openness about testing and evaluation. He criticized the vague language used by officials and stressed the importance of collaborating with the four companies to understand their internal safety assessments and red-teaming protocols.

Raw Model Weights and Security Implications

Shanahan also raised questions about whether the companies shared their raw model weights with the DOD. These weights represent the trained intelligence of AI systems and would significantly enhance the government’s ability to test and adapt the models internally. The lack of clarity on this issue adds to the growing unease about oversight in the deployment of frontier AI.

Expert Warnings on Speed and Safety

Dr. Heidy Khlaaf, chief AI scientist at the AI Now Institute, highlighted that traditional T&E processes usually take far longer than the timelines observed in these recent contracts. She criticized the DOD’s decision to reduce the size of its Office of the Director of Operational Test and Evaluation, describing it as a signal of prioritizing speed over safety.

Khlaaf argued that commercial models are inherently riskier than purpose-built military ones and that skipping rigorous assessments undermines long-standing safety standards. She cited recent incidents, such as xAI’s Grok model producing antisemitic content due to manipulated system prompts, as evidence of the potential dangers.

Operational Risks and Experimental Deployments

The contracts are structured as indefinite-delivery, indefinite-quantity (IDIQ) agreements, allowing flexibility in procurement over time. However, concerns remain that these models may already be in unofficial use within the Pentagon, despite existing policies.

Shanahan warned that using AI for critical functions like intelligence analysis or operational planning without robust validation could have severe consequences. If the training data is compromised, the output could mislead decision-making, potentially endangering missions or national security.

The Importance of Independent Assessments

Khlaaf emphasized that independent assessments are a core requirement for military systems to ensure security and functionality. She warned that using critical military data to fine-tune these models could expose them to vulnerabilities, including data extraction, dataset poisoning, and embedded sleeper agents designed to subvert systems.

“Experimental use is not risk-free,” Khlaaf cautioned, explaining that without preliminary testing, even limited deployments could compromise sensitive operations and inadvertently propagate errors across military decision-making pipelines.

The Risk of Hallucinations and Accumulated Errors

Another issue is the known tendency of large AI models to produce hallucinations—fabricated or inaccurate outputs. Used in administrative functions like coding or IT ticket resolution, such models could introduce errors that accumulate over time and affect mission-critical decisions.

“AI tools consistently fabricate outputs and introduce novel vulnerabilities,” Khlaaf said. “Over time, small errors can escalate, leading to decisions that cause civilian harm or tactical mistakes.”

Company Responses and Certification Standards

OpenAI, Google, and xAI did not respond to requests for comment. Anthropic acknowledged the concerns and highlighted their rigorous internal testing protocols. A company representative noted that Anthropic was among the first AI labs to receive ISO/IEC 42001:2023 certification for responsible AI, and that their Claude models are among the least likely to hallucinate or be susceptible to prompt injection attacks.

Despite these reassurances, Anthropic confirmed that it does not share its model weights with external entities, including the DOD. The company also stated that it is working with cloud providers to ensure compliance with the DOD’s information-handling requirements.

This article is inspired by content from Original Source. It has been rephrased for originality. Images are credited to the original source.