Extracting JSON Code with Nested Curly Braces in Ruby (the long painful way around, with help from GPT4)

Given a text string that contains JSON code with possible nested curly braces, I needed to extract the outermost JSON code, including the curly braces. Here's an example of such text, which you may recognize as the output of an LLM (specifically GPT in this case):

Here's the JSON you requested:

{
 "title": "Brainstorming ideas",
 "summary": "The user discussed exporting basic profile bots",
 "sentiment": "positive",
 "language": "English",
 "additional_information": {
    "tags": ["brainstorming", "bots", "automation"]
  }
 }

An initial crack at extracting just the JSON with a regex might look like this, but the outermost selector is too greedy.

> text.match /.*(\{.*\})/m; $1
=> "{\n    \"tags\": [\"brainstorming\", \"bots\", \"automation\"]\n  }\n }"

I took a hard look at that regex and figured that I just needed to make the first .* less greedy for it to work fine, like this:

/.*?(\{.*\})/m

And here is where the story gets more interesting (or stupid, depending on your perspective.) Because I tried it and unfortunately it DID NOT work. And since it was late at night and I was tired, instead of trying to figure out why I got frustrated and asked GPT4 to help me out.

GPT offered me this regex instead: /(?<=\{)(?s:.*)(?=\})/m

I've occasionally had to use PERL over the years so I was at least passingly familiar with the syntax here. It uses a positive lookbehind (?<=\{) to assert that the match starts after an opening curly brace, and a positive lookahead (?=\}) to assert that the match ends before a closing curly brace. The (?s:.*) part matches any characters, including newlines, between the curly braces, and finally, the /m flag is used for multi-line matching. Which should have worked fine, except that I'm coding in Ruby, so it generated this exception: undefined group option

It turns out that Ruby's regex engine does not support these operations (or if it does, I was not able to figure it out).

When I pointed this out to GPT4 it proposed a more straightforward and brute-force approach by writing a Ruby method that loops through the text and counts the opening and closing curly braces. This method can efficiently find the starting and ending indices of the outermost JSON object.

Here's the Ruby code to extract the JSON code from the text:

def extract_json(text)
  start_index = text.index('{')
  return nil if start_index.nil?

  end_index = start_index
  brace_count = 1

  text[start_index + 1..-1].each_char.with_index do |char, index|
    brace_count += 1 if char == '{'
    brace_count -= 1 if char == '}'
    end_index += 1

    break if brace_count.zero?
  end

  text[start_index..end_index]
end

This code defines a method called extract_json that takes the input text as an argument. It then loops through the text to find the starting and ending indices of the outermost JSON object by counting the opening and closing curly braces. Once the outermost JSON object is found, it is returned as a string.

It passed my tests so I committed the code and called it a night. But that first failed regex solution still bothered me this morning, and eventually I opened a terminal and started playing around with it again. Making that first .* operator less greedy should work.

And guess what? It does work.

> text
=> "Here is your JSON:\n\n```\n{\n \"title\": \"Brainstorming ideas\",\n \"summary\": \"The user discussed exporting basic profile bots\",\n \"sentiment\": \"positive\",\n \"language\": \"English\",\n \"additional_information\": {\n    \"tags\": [\"brainstorming\", \"bots\", \"automation\"]\n  }\n }\n```\n"

> text.match /.*?(\{.*\})/m; $1
=> "{\n \"title\": \"Brainstorming ideas\",\n \"summary\": \"The user discussed exporting basic profile bots\",\n \"sentiment\": \"positive\",\n \"language\": \"English\",\n \"additional_information\": {\n    \"tags\": [\"brainstorming\", \"bots\", \"automation\"]\n  }\n }"

Why I couldn't get it to work yesterday? No idea. But I did learn a lesson about trusting GPT4 over my own instinct. I'm going to call it a double-edged sword. You got to be careful you don't cut yourself. This particular solution wasn't that big a deal, but I can imagine situations where it would be.

Interestingly enough, I pointed out the failure to GPT4 and it admitted that it made a mistake.

You are correct, and I apologize for my earlier explanation. The regex /.*?(\{.*\})/m indeed works for extracting the outermost JSON code, including when there are nested curly braces, as demonstrated in your example. The non-greedy .*? before the capturing group ensures that the regex starts matching at the first opening curly brace. The greedy .* inside the capturing group then matches all characters, including nested curly braces, until the last closing curly brace is found.

Thank you for pointing this out, and I apologize for the confusion in my earlier responses.