English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

To ensure AI safety, instruction-tuned Large Language Models (LLMs) arespecifically trained to ensure alignment, which refers to making models behavein accordance with human intentions. While these models have demonstratedcommendable results on various safety benchmarks, the vulnerability of theirsafety alignment has not been extensively studied. This is particularlytroubling given the potential harm that LLMs can inflict. Existing attackmethods on LLMs often rely on poisoned training data or the injection ofmalicious prompts. These approaches compromise the stealthiness andgeneralizability of the attacks, making them susceptible to detection.Additionally, these models often demand substantial computational resources forimplementation, making them less practical for real-world applications. In thiswork, we study a different attack scenario, called Trojan Activation Attack(TA^2), which injects trojan steering vectors into the activation layers ofLLMs. These malicious steering vectors can be triggered at inference time tosteer the models toward attacker-desired behaviors by manipulating theiractivations. Our experiment results on four primary alignment tasks show thatTA^2 is highly effective and adds little or no overhead to attack efficiency.Additionally, we discuss potential countermeasures against such activationattacks.

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment