Advertisement

Cursor's Reflections on Key Issues in the AI Programming Domain

Previously, Mr. Yuan sent me two articles from the Cursor team shared by Yusen's boss in AI Code. I didn't have time to read them until today when I finally took some time to look through them:

  1. OCTOBER 12, 2023
    [Our Problems] - (https://www.cursor.com/blog/problems-2023)

  2. MAY 25, 2024
    [More Problems] - (https://www.cursor.com/blog/problems-2024)


Our Problems

01

Specific problem list for 2023

  • : In code editors, there are many sources of information: open files, semantically similar code blocks, symbol-related classes, lint output, execution traces, git history, input history, external documents, etc. We want the model to instantly understand content related to user questions. Currently, we are training a custom fast re-ranking model to address this issue. For each request, we collect 500,000 tokens from all different sources and use the re-ranking model to filter them down to the most relevant 8k tokens. This is not only a model problem but increasingly an infrastructure problem.

  • : While GitHub Copilot is very helpful in eliminating low-entropy key presses when writing new code, it does not excel at saving on low-entropy key presses when small, simple modifications to existing blocks of code are needed. For instance, renaming becomes more complex than simply pressing F2 to rename, involving more navigation, deletion, and input. We need innovation both in UX (non-disruptive presentation of differences during the coding process) and on the model side (prompt-based methods cannot solve this due to cost, latency, and intelligence issues).

  • : Similar to OpenAI's code interpreter but applied to large-scale codebase engineering. You give a restricted, few-step proxy instruction, and it will help you search, write, and run code while continuously soliciting your feedback. The first step toward achieving this goal is to enable the proxy to work within folders containing hundreds of thousands of tokens. Currently, we are working in this direction. If successful, we will expand it to cover the entire codebase.

  • : There are two modes here: (1) In the background, Cursor will passively scan your files for potential bugs; (2) During the debugging process, Cursor will actively assist you in finding bugs. There is a lot of interesting data collection work that can be done in this area.

  • : Cursor should be able to modify entire files, or even entire directories for you. This is a challenge in both capability and UX. To improve speed, the model needs to be smart enough to pick out the parts that need modification rather than rewriting the entire file. To enhance the experience, the modification process needs to be presented to users in a parseable real-time form.

  • : As of October 12, 2023, we have indexed 1.4 billion vectors and 150,000 code repositories, and this number is expected to grow tenfold by the end of the year. We have already built a very fast Merkle-tree-based repository synchronization engine in Rust, and we may soon need to build a custom indexing system.


Future vision of 2023

  • : Predict and display cross-file code modifications you will make in the next 15 minutes. Accept all insert/delete operations via a single command.
  • : Our model should be able to deeply understand all concepts within any codebase and reflect that understanding in its weights.
  • :Make code understanding easy through documentation at any level and a bot that guides you to understand relevant code paths.
  • :Edit the "abstract" representation of the code and let the changes be automatically applied to the source code.
  • :The IDE should automatically identify stack errors and fix the code for you.

We tried to collect all the issues we're currently considering, but — and this is also one of the great things about building a product that you yourself use 12 hours a day — we constantly have new ideas and re-prioritize. So, this should not be taken as a final roadmap. However, we hope it gives you a sense of the direction we think about every day.


More Problems

02

Key questions for 2024

1. Next-action prediction

Cursor comes with **Copilot++**, a smarter version of Copilot that can predict your next edit. Can we take this idea to the extreme?

When coding, you're not just doing low-entropy edits. Across the entire editor, you're performing low-entropy keystrokes, clicks, and actions. Can we build a model that predicts every action with low latency?

First, we extended Copilot++ to predict your next location. Combining it with the next edit prediction, the model can smoothly perform a series of low-entropy changes:

(for obvious reasons).

We are working on predicting the next file you will move to. The next terminal command you will run. Predict the next edit based on your previous terminal commands! This is

In addition, the model should present relevant information instantly when you need it, whether it's a related code snippet or documentation.

Cursor should extend your intent. When you think of a change, the language model should be able to execute it immediately with minimal instruction.

Promising directions:

  • Basic research.
  • Continuous pre-training and post-training on code models with ~5-13B active parameters (for low-latency prediction used in prefilling).
  • inference techniques.
  • Designing a clever user experience that presents "actions" in a non-intrusive way (e.g., how to suggest to users to move to the next file? Or how to propose the next location outside the current viewport?).

2. Perfect Editing

Can we generate higher-quality, larger-scale edits by expanding the computation for reasoning? How do we offset the resulting latency?

Perhaps it's necessary to execute edits in the background. By initiating a work unit you can trust, completed by an intelligent model.

We need models capable of using specialized editor tools, able to understand broader codebase contexts, and enhance long-term reasoning abilities.

Also, how do we ensure that asynchronous code generation maintains process consistency? This sounds contradictory, but we believe that through clever research on model capabilities and user experience, this goal is achievable.

Hallucination Pseudocode

We implement some non-existent functions/code, and then the model creates these codes for us in the background.

The user will write pseudocode to describe the desired changes. Then, we can trust Cursor to compile the pseudocode into a complete change in the background.

Multi-file Editing

Cmd-k is already excellent, but what if you could request general edits across the entire codebase? Especially being able to make accurate modifications across multiple files?

Promising directions:

  • : We know that reward models and rejection sampling can bring quick and easy improvements, but how far can we go?
  • (e.g., GPT-5, Claude-4, Gemini 2.0)
  • : This requires tool usage capabilities of the model and the ability to remotely replicate the runtime environment.
  • Train/boost model performance on proxy trajectories
  • : For supporting asynchronous editing in streams.

3. Best Context

Solving a query might involve millions of document tokens, tens of millions of source code tokens, and another tens of millions of commit history tokens, all of which could potentially help solve the problem.

Not to mention pixels in the UI, logs in production and local environments, messages in Slack, and so on...

We believe that the best programming systems will combine retrieval, recursion, and long-context attention mechanisms to process and leverage this information.

Because, in the short term, this might be a collection of models and infrastructure that together form an infinite-context engine for programming. And in the long term, we expect it to be integrated into the architecture.

We're particularly excited when we creatively think about the future of retrieval. Beyond embeddings, what is the best possible performance under a model where indexing is expensive and querying is cheap (sublinear in the size of the corpus)?

as a differentiable search index. It could also be something entirely different. This remains an under-explored research direction.

Multi-hop context

In my codebase, I want to calculate the difference between two strings. Using embedding techniques, the snippet I obtained is:

function computeDiff(
  firstModel: ITextModel,
  secondModel: ITextModel,
): string 
{
  //...
}

. This is a query that requires two hops to resolve.

In a codebase, the hardest questions and queries usually require multiple hops. Ordinary retrieval methods can only solve single-hop problems.

A promising direction

  • Specialized/improved embedders and rerankers for code repositories
  • : Given a query and the relevant code we've found so far, determine the next snippet of code to jump to.
  • Perhaps a custom attention mask could be used, which would be more suitable for processing the codebase.
  • Innovative research on codebase-level retrieval
  • Similar to using Transformer as a search index.

4.Error detection and debugging

Existing error detection systems still have difficulties in calibration and understanding of the codebase.

The model is intelligent enough to correctly identify errors but is still prone to false positives. Identifying the trickiest errors requires a deeper understanding of the codebase. Code that appears problematic may turn out to be harmless when viewed in a broader context.

One possible approach is to significantly enhance the code review experience by using language models:

Error detection in AI reviews

The advantage of "AI reviews" is that users have a higher tolerance for false positives because they are requesting the review. The downside is that it requires a change in user behavior.

AI Linting

The best way to catch errors is with a always-on Linter that can catch your mistakes as you make them in the background. It needs to be cheaper and faster than an AI review model, since we might run it several times per minute. It also needs to be tuned to reduce false positives.

Smarter Debugging

Perhaps even more impressive than error detection is the ability to debug hard problems.

package. When injected into code, it can trace runtime information.

In the backend, we can even use it to track the status of additional variables (similar to print debugging, by passing relevant outputs to the context of Cursor).

Promising directions:

  • Clever dataset curation (possibly synthetic data) and cutting-edge code model calibration based on reinforcement learning.
  • Tracking information from other surfaces (such as browsers or non-integrated terminals).
  • Enhancing the performance of advanced models in the use of debugger-specific tools and chained operations.
  • Unlimited context and near-perfect understanding of the codebase.
  • Extend our cursor/debug library to track all useful program state information.