Fine tuning LLM for Readability
Question 1: LLM are amazing - can I fine tune a state of the art LLM like Lama 3.1 with what I consider to be very high quality writing from my private library content to create a writing assistant?
Can I do it in a way that is aware of the writer’s style and the text domain other high level features so that the prompt can be used to condition the output this way.
Can I test it on wikipedia articles and see if it can improve the readability of existing and new articles?
Learning to write well
learning to write from:
- “Best American Science Writing” series
- Bast papers - sourced from leading conferences
- Authors
- Oliver Sacks,
- Natalie Angier,
- Alan Lightman,
- Sylvia Nasar,
- Matt Ridley
- Genome
- Red Queen
- Viral
- Steven Pinker
- How the Mind Works
- The Blank Slate: The Modern Denial of Human Nature
- The Language Instinct: How the Mind Creates Language
- The Secret Life of Verbs
- Words and Rules
- Hotheads
- Richard Dawkins
- Selfish gene
- Great Explianers
- Richard Feynman
- lectures on physics
- QED
- The pleasure of finding things out
- The character of physical law
- Surely you’re joking
- What do you care about what other people think
- Six easy pieces
- Levitt and dunbar
- Freakonomics
- Michel Foucout
- Discipline and Punish
- The Birth of the Clinic
- The Order of Things
- The Archaeology of Knowledge
- Madness and Civilization
- Leonard Susskind
- Theoretical Minimum
- Richard Hawkins
- Jared Diamond
- Guns, Germs, and Steel
- Collapse: How Societies Choose to Fail or Succeed
- The World Until Yesterday: What Can We Learn from Traditional Societies?
- The Invisible Hands: Top Hedge Fund Traders on Bubbles, Crashes, and Real Money
- etc
- C.S. Lewis
- A Grief Observed
- The Problem of Pain
- The Screwtape Letters
- The Great Divorce
- Mere Christianity
- A Preface to Paradise Lost
- Eric Metaxas
- Martin Luther
- Bonhoffer
- Discussing Mere Chritianity
- Yuval Noah Harari
- Sapiens - A brief history of mankind
- Homo Deus
- 21 Lessons for the 21st Century Audiobook
- Primo Levi
- The Periodic Kingdom
- Mcluhan Marshall
- The Medium Is The Massage
- Empire of the Summer moon
- Hidden Figures
- Art of War
- Book of five rings
- Adam Smith
- The pencil
- Dan Ariely
- Chris Anderson
- The long tail
Plato
Aristotle
- Machiavelli
- The prince
- James Surowiecki
- The Wisdom of Crowds
- Robert A. Caro
- The Power Broker: Robert Moses and the Fall of New York
- Working
- Master of the Senate
- Stephen Jay Gould
- Ian Ayres
- Super Crunchers -
- Giles Milton
- Nathaniel’s Nutmeg
- D-Day The Soldiers’ Story
- When Hitler Took Cocaine and Lenin Lost His Brain
- Fascinating Footnotes From History
- Churchill’s Ministry of Ungentlemanly Warfare
- Wolfram The Boy Who Went to War
- The Extraordinary Story of Thomas Pellow and Islam’s One Million White Slaves
- The Stalin Affair: The Impossible Alliance that Won the War
- Samurai William: The Englishman Who Opened Japan
- Edward Trencom’s Nose
- Russian Roulette - A Deadly Game: How British Spies Thwarted Lenin’s Global Plot
- Richard Feynman
etc
the ideas here are :
- the primary text
- the wikipedia article on
- summaries
where we want to focus on the primary text but and also to highlight its structure
The primary text has lots of words but one top level structure a few chapter level structure many paragraph level structure
idealy we want to learn structures:
\text{paragraph} \to \cdots \to \text{top level} and idealy we would like to learn to research
i.e. source the ‘facts’ from reliable sources which we cite inline.
teach an LLM to rewrite text with high fidelity yet increase thier readability.
high quaity data sets:
- wikipedia v.s. higher quality e.g. britanica, or others
- check this isn’t the best of wikipedia
- check the citations (is there a significant overlap we are in the ballpark)
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Fine-Tune Llm for {Style} and {Grammar} Advice.},
date = {2024-12-20},
url = {https://orenbochman.github.io/posts/2024/2024-09-30-LLMs/readability.html},
langid = {en}
}