Summarizing UrbanToronto threads with LLMs

2–3 minutes

Categories:

The forum format of communication has enough merits and perils to spend a lifetime debating, but that’s for another day. Given the general trend of rich, high-return information on the internet going by the wayside, the continued existence of the UrbanToronto forum is a minor miracle.

That said, so much transit discussion being crammed in a few threads has a few pitfalls:

  1. Discussion often goes in circles, with many reply chains starting that basically rehash things that were previously discussed
  2. It’s hard for newcomers to know where to start reading
  3. If you’ve been away from a thread, it’s not obvious how far back you must go to be caught up on whatever the latest development is

I can’t do anything about the structure of the forum (and am not sure that it would be desirable to make any change), but I feel like some general index or summary of the 1000+page long threads would be useful. The end product would be a list of topics and links to pages where they’re discussed, and a list of topics discussed for each page or range of pages (where one topic spans multiple pages of discussion).

Here’s a rough cost calculation for some transit-related threads using GPT-4. Note the tokens/page is an estimate which accounts for the construction thread having many photos taking up space on each page and thus reducing the amount of text. I’m also only calculating based on input, the cost of output will be negligible in comparison.

And some for GPT-3.5, which should probably be good enough for summarizing.

The only concern for 3.5 is the shorter context window, but I think if I use the summary it creates for each page n as context for page n+1 I can squeeze in enough for the summaries to be okay, albeit with some loss.

I’ll do a test run on a few pages soon and work on prompting it right to get some decent output but only after I give the terms of service a read and/or ask someone from UT if it would be permitted. Overall I think it would enhance readership of the forum but of course I will play by the rules.