New Research Examines Safety Risks and Evaluation Methods for AI Models in Scientific Domains
Researchers have published three papers examining different aspects of large language model (LLM) safety and evaluation in scientific contexts.
According to a paper published on arxiv.org, a study compared two methods for disabling LLM safety guardrails: jailbreak-tuning (JT) and weight orthogonalization (WO). The research evaluated six popular LLMs across malicious and benign tasks, finding that “while refusal degradation is split between the two methods, WO produces LLMs far more capable of aiding in malicious activity.” The study reports that WO-unaligned models are “far less prone to hallucinations, better retain their original natural-language performance, and are more effective at state-of-the-art adversarial and cyber attacks” compared to JT. The researchers found that supervised fine-tuning can “effectively limits the adversarial attack abilities enabled by WO, without drastically affecting hallucination rates or natural language performance.”
In a separate paper, researchers introduced SafeSci, described as “a comprehensive framework for safety evaluation and enhancement in scientific contexts,” according to arxiv.org. The framework includes SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a dataset containing 1.5M samples.
Additionally, researchers developed DrugPlayGround, a framework published on arxiv.org for evaluating LLM performance in drug discovery, focusing on generating descriptions of “physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules.”