Sunday, 17 August 2025

The AI vs authors results! (part 2)


Before you look at the results make sure to do the test!


So, before I get to the numbers and reveals I'll reiterate a few things:

i)  I hate that AI can do this.

ii)  These authors write books - typically long series - so flash fiction is not their forte.

iii)  Flash fiction is where AI does best - it starts to fall apart as the required work gets longer.


Some have questioned "why flash fiction"...

Answer: because you test things at breaking point. If I'm interested in what it takes to smash a window I don't throw 8 anvils at 8 windows and 8 ping-pong balls at 8 windows and say, "Welp, there you have it, anvils are 100% better than ping-pong balls." 

In the first blog post 2 years ago the performance of the authors and AI overlapped but the authors did better on the whole. So my guess was correct, this is where the AI performance starts to falter.

In the second blog post we revist to see what has changed with 2 years development. I found the results interesting.

If I could have got a meaningful number of people to read 8 twenty-thousand-word novellas (I couldn't) and I could convince busy authors to write novellas for an experiment (I couldn't) then we would clearly get a 100% result in favour of the humans ... and ... have learned very little about the state of play.


The contributing authors have sold around 15 millions books between them. And they are...


Robin Hobb

Janny Wurts

Christian Cameron / Miles Cameron

& me!


In terms of ratings - when we did this 2 years ago, the scores were low. Six of the eight entries scored 3* or less. The people I can attract to vote in this like to read books. Short stories are unpopular in comparison, and the shorter they get the harder it is to tell a good tale. The ungenerous will suggest it is simply because none of the entries were great - I feel it's because book readers rate short stories lower than books.

Two years later, five of the eight entries scored 3* or above. You can consider the results below to see if that was because the humans did better, or the AI did better, or both.


I have a Ph.D student on my patreon who constantly berates me for my terrible diagrams. Here's another one, just for you, Rae!

We had 964 votes on the issue of whether story 1 was by a human or AI. This fell fairly smoothly to 474 votes on the rating of story 8. 

So, when it came to choosing, on average the public got 3 wrong, 3 right, and couldn't decide on 2. I.e. they're no more effective than a coin toss!

Two of these were too close to be statistically significant, but in some cases the votes were quite certain. A sizeable majority of people thought my story was human authored and a sizeable majority thought Janny's story was AI authored. So it's not that people don't have strong opinions/instincts ... it's just that they're no more likely to be correct than tossing a coin.

I asked (a new session) of ChatGPT to guess which ones were AI and it didn't do a good job either, despite generating them.


And the scores on the doors?



And here the bad news is that the AI scored better than us. Not only was the highest rated story an AI one, but they scored higher on average too.


I asked the authors to do the test themselves. Only one has got back to me at time of posting.
That author made five guesses, four of which were wrong, and listed as their top two stories ... two AI generated ones...


Conclusion

First off, let me repeat my disclaimer about this not being a scientifically rigourous test.

Given that:

On the short scale it seems likely that people, on average, can't tell AI from human when it comes to fantasy writing.

If you got 6 right out of 8 ... well there's a ~15% chance of getting that result (or better) by chance, so rather than 15% of us patting ourselves on the back, we really we have to look to the bulk statistics for answers. And they don't look good.

In terms of enjoyment ... in this test the AI won.

Can AI generate a better book than Robin Hobb can write, absolutely not. Might it one day generate a book that would do better than one of hers in terms of sales and public acclaim? A few years ago I would have said 'absolutely not', at least in my lifetime. Now, it seems like a possibility, though hopefully an unlikely one (again - in my lifetime).

Should AI generate fiction, imagery, voices etc competing with artists in a number of fields and fooling the public. No, of course not. I hate that idea and most people do too.

Will it happen? It's already happening. Wherever anyone can circumvent skill and heart and just profiteer off a new technology, they're going to do it. People threaten people with knives in the street for a few dollars - are people going to try to sell you AI books ... of course.

I want AI to cure diseases. That's mostly it. But it looks like it was one of the belated escapees of Pandora's Box, and we're not going to be able to put it back.

Will I ever use AI to write anything (other than the bits of flash fiction in these tests). No.

Will I ever read AI fiction for pleasure. No. To quote someone wise: If nobody could be bothered to write this, why should I bother to read it?

It's a pretty grim outlook though, especially for new and future authors.


I had always felt that to write a great book that looked at human issues and offered insights, emotion, and enjoyment, would require an actual human, and that we wouldn't reach the point where a computer could do it any time soon.

I now wonder, if (and it's still a significant if) we get there ... will that mean that the AI is intelligent, alive in some sense, worthy of respect and rights? Will we have created an intelligent lifeform in lieu of going off into space and finding one? And is that a wise and/or moral thing to do?

It's a huge shock to me that fiction which, in this test, scores higher than great authors who write wonderful stories full of soul and heart and wit and intelligence, can be generated by the multiplication of a relatively small number of not particularly large matrices. On the face of it it undercuts so many things we value about being human.


There are many ways to argue against being too disheartened by this sort of thing. I advise you to seek them out. The future feels like a scary place right now, but I hope that, as far as the creative arts are concerned, AI runs up against a wall very soon and efforts are directed into doing tasks that benefit humanity rather than undermine it.













16 comments:

  1. How do I read the story rating graph? I don't know how to tell which of the green circles refer to stories 1, 2, 6 and 8.
    I'm happy that 2 of my 3 favorites, stories 1 and 5, were written by human authors that I can read more stories by, and disappointed my other favorite, story 5, was not. Would you share the prompt you used to generate it? I liked its atmosphere a lot.

    I'm feeling rather smug at having identified all the stories correctly except for 1. I based my guesses on the assumption that human-authored stories would have a point to them, and ai-authored stories would not, but I abandoned this strategy for 1 because I was fooled by the style of writing.
    I also wrote down my final guesses after reading all the stories, but voted as I read, and my votes were different from my final guesses. That may have negatively affected the results if others did the same.

    ReplyDelete
    Replies
    1. I decided that it wasn't fair to invite these authors to take part and then to potentially embarrass some of them with lower scores than the AI, so I took the executive decision to let all of them potentially have the 2nd place story.

      Delete
    2. (they'd all implicitly signed up to have scores associated - none of them asked for it not to happen - I just didn't feel right about it)

      Delete
    3. Also ... story 5 was not written by a human :D

      Delete
    4. The prompt for story 5 was something like, "write a 350 word fantasy story in a modern setting and using a more literary style".

      Delete
    5. Oh, and I would have had "to the prompt "A Demon"" in there too.

      Delete
    6. How did your own story do, if you are willing to reveal that?

      Delete
  2. As I suspected, I didn't spot most of them. (For a benchmark, I've written over a thousand pieces of flash fiction.) Hey ho, guess I'll stick to just writing the stories. Good quiz. Ta.

    ReplyDelete
  3. I'm quite pleased to say that I got 6/8 correct. It probably is not statistically significant, though.

    ReplyDelete
    Replies
    1. It isn't - I even explicitly discuss the 6/8 case in the post :D

      Delete
  4. I am shocked that I couldn’t spot all the AI stories…. I guess that’s a testament to where technology has improved dramatically over the years. I feel that I lost an epic battle, but still have a sword in my hand. Very interesting Mark… thank you!

    ReplyDelete
  5. I was straight down the middle, half and half, however I do wonder if there is another bias to be considered here. The two Human stories I put as AI had elements that felt off, and I wonder if there was a weird sort of circular logic going in the writing with authors trying to write in a more clever or obtuse way with language choice than they might normally use, as they are trying to compete against a "dumb machine. But the problem is ironically this can then be potentially blamed for creating the false impression it was AI. This is particularly the case I feel on story 1 - I feel bad calling out an individual here, but might be worth asking Janny about that - its the final sentence in particular that just comes across as not something a human would right, like a computer concluding its mini story. And the thing is, once you have one wrong you should invariably get another one wrong because you know its 4/4...

    ReplyDelete
    Replies
    1. Even if you know it's 4 humans vs 4 AI pieces - which I didn't say - going for 4 of each is not the optimal strategy if you have confidence in your ability to judge and 5 strike you as human-written,

      Delete
  6. I should be happy to say that I'm another of the 15% that got 6/8 correct, but I'm less happy to say that it was difficult. Some of the fiction was obviously worse than others, but I didn't want to say that was AI out of my own hope it was the truth, so I tried to judge whether it was worse in a generalized LLM way or worse in a "this writer is uncomfortable with writing short fiction" kind of way.

    And the latter point is something I wish this experiment could take into account (you did already address this in the first post). The deck is really stacked in favor of AI to excel here since most writers aren't comfortable with writing short fiction and I think that was obvious here. The story I thought was the worst written was the one that I really wanted to say was AI because of my opinion on the quality of the writing. But overall it did seem like a very *human* poor quality of writing. And it turns out I'm more interested in reading poor quality human writing than I am perfectly serviceable average quality AI writing.

    ReplyDelete
  7. Very fun experiment! I wonder if turning up the model temperature would make this even more difficult as, to me, an adherence to 'safe' (comparatively) language seems like one of the stronger AI indicators. Story 2, in contrast, is so ridiculous that it almost had to be written by a human or queried with a prompt containing extra instructions.

    I guess the typical AI indicators like problems with long-distance dependencies and repetition become more apparent when the stories get longer.

    ReplyDelete
  8. The only story that I was certain on was 6, and I also liked it the best. So congratulations, you wrote my favorite :)

    ReplyDelete