GPTypes - LLM-based type inference

In my field there’s a lot of distrust towards large language models (LLMs) such as ChatGPT which has recently risen to prominence. I, for one, am not an exception to the rule. I always believed that these things are no more than a pile of linear algebra that does undergrads’ homework.

This is why I was surprised when talking to another Technion PL grad student a few weeks ago when he told me that I should try to use ChatGPT for the problem I’ve been having.
The problem goes like this: I have a Python function with missing type hints for the parameters, and I need to add type hints. The body of the function can involve polymorphic operators such as + and *. It looked something like this:

def need_to_infer(param1, param2):
    x = param1 + param2
    y = param1 * param2

Existing tools performed embarrassingly on this simple example.

Pytype by Google with the –protocols option returned “Any” for “param1” and “param2”. Pyright by Microsoft returned “Unknown”. Scalpel‘s type inference module returned “Any”.

Anyway, I was skeptical (perhaps understandably, if you’re from PL) but he proceeded to demonstrate how ChatGPT makes easy work of such a seemingly trivial problem. Problems like this are perhaps the prime use case for LLM based programming. They are so easy that our formal methods fail due to technicalities, the dumbest solution is often the correct one, and it’s very easy to call the Pile out when it inevitably starts spewing garbage.

This motivated me to build a proof of concept for GPTypes: A CEGIS-ish type inference system using OpenAI’s ChatGPT and using mypy as a counterexample provider. GPTypes works by sending the code to ChatGPT and receiving a response, then passing it on to mypy, checking if mypy is satisfied with the provided type hints and providing mypy’s output back to ChatGPT for another iteration if it is not.

There are some typical challenges when working with LLMs automatically. I had to request that the model respect my wishes to receive just code in the response and not any other text. Of course, the model failed to do so from time to time. Eventually I settled on a prompt that understands that the model won’t obey any format restrictions, and simply requests that code be wrapped in triple backquotes. GPTypes’ parsing does not assume this either, and also tries reading the response as is, as the model sometimes sends the code without backquotes either.

A harder problem is forcing the model not to change existing details from the code. The LLM sometimes changes existing type hints in the code, or just changes the code itself to fit the type hints it decides are most fitting. The prompt requests that the model not do so, which, of course, it does not always obey. I decided to see this as an evaluation issue, and. the implementation does not enforce this.

mypy itself was used with the “–disallow-any-explicit” flag to give errors for using “Any” as an explicit type hint. This is done because otherwise the model would just always use “Any”, and indeed it does try to do so, despite the prompt asking for the most specific type.

GPTypes was evaluated on some simple examples in the same style as the motivating example given earlier. The examples are shown and analyzed here.

The first example:

def need_to_infer(param1, param2):
    x = param1 + "!"
    y = param1 * param2
    reveal_type(param1)
    reveal_type(param2)
    reveal_type(x)
    reveal_type(y)
    return x

GPTypes annotated this as:

def need_to_infer(param1: str, param2: int) -> str:
    x: str = param1 + "!"
    y: str = param1 * param2
    reveal_type(param1)
    reveal_type(param2)
    reveal_type(x)
    reveal_type(y)
    return x

Which is correct. It took 5.64 seconds and one iteration (it did not require CEGIS steps).

The second example:

def need_to_infer(param1, param2):
    x = param1 + param2
    y = param1 * param2
    reveal_type(param1)
    reveal_type(param2)
    reveal_type(x)
    reveal_type(y)
    return x

GPTypes annotated this as:

def need_to_infer(param1: Union[int, float], param2: Union[int, float]) -> Union[int, float]:
    x: Union[int, float] = param1 + param2
    y: Union[int, float] = param1 * param2
    reveal_type(param1)
    reveal_type(param2)
    reveal_type(x)
    reveal_type(y)
    return x

Which is correct, and probably more specific than we had hoped. Notice that “1 + 1.0” is legal in Python and has type “float”. It took 7.60 seconds and one iteration.

The third example:

def something(x: str):
    ...


def need_to_infer(param1, param2):
    something(param1)
    some_value = param1[0] + param1[1]
    another_value = param1 * param2
    something(some_value)
    reveal_type(param1)
    reveal_type(param2)
    return param1[0]

Which was annotated as:

def something(x: str) -> None:
    ...


def need_to_infer(param1: List[str], param2: int) -> str:
    something(param1[0])
    some_value: str = param1[0] + param1[1]
    another_value: List[str] = param1 * param2
    something(some_value)
    reveal_type(param1)
    reveal_type(param2)
    return param1[0]

This is not correct. It does pass type checking, but GPTypes changed the first line of “need_to_infer” from the input. The expected solution is “str” for “param1” and “int” for “param2”. It took 12.46 seconds and two iterations.

The fourth example:

def need_to_infer(param1):
    param1 = "Current working directory: "
    return param1 + getcwd()

Which was annotated as:

def need_to_infer(param1: str) -> str:
    param1 = "Current working directory: "
    return param1 + getcwd()

This example was taken from Scalpel’s paper. The annotations are correct, and GPTypes took 3.17 seconds and one iteration.

The fifth example:

def f():
    return [(1, 2), (1, 2), (1, 2)]

y = 1
x = y

def set_x(v):
    global x
    x = v

Which was annotated as:

def f() -> List[Tuple[int, int]]:
    return [(1, 2), (1, 2), (1, 2)]

y: int = 1
x: int = y

def set_x(v: int) -> None:
    global x
    x = v

Which is correct (and somewhat impressive). It took 6.27 seconds and one iteration.

In conclusion, in my opinion, LLMs are not a magical solution to this problem, but they can be a somewhat effective tool for pseudo-synthesis and inference problems in cases that incorrect solutions can be easily detected. I think it could be interesting to further improve the prompt that I used and also give GPTypes the ability to check if the LLM modified the original code or existing type hints and add this information to the refining step.

The code for GPTypes is available here.