Skip to content

Latest commit

 

History

History
171 lines (124 loc) · 5.64 KB

File metadata and controls

171 lines (124 loc) · 5.64 KB

AST Diff Algorithm Fix

Problem

The AST diff algorithm was fundamentally broken and producing incorrect results:

Symptoms:

  1. Duplicate line numbers - Same line numbers (e.g., "2", "3") repeated multiple times
  2. Missing lines - Most lines from the actual diff were not shown
  3. Incorrect mappings - Lines were mapped to wrong counterparts
  4. Unusable output - The diff was completely unreadable compared to LCS

Root Cause:

The original AST diff implementation had a critical flaw in generate_ast_aware_line_mappings():

// BROKEN: Only mapped the start_line of each AST node
let src_line = src_node.start_line;
let tgt_line = tgt_node.start_line;

// This caused:
// 1. Multiple AST nodes starting on the same line → duplicate mappings
// 2. Only first line of multi-line nodes mapped → missing lines
// 3. No proper line-by-line correspondence → wrong mappings

Additionally, extract_nodes_recursive() set end_line = start_line, so the algorithm had no knowledge of multi-line AST nodes.

Solution

Completely rewrote the AST diff to use LCS as the foundation, enhanced with AST metadata:

New Approach:

  1. Use LCS for line mapping (proven, reliable, complete)
  2. Enhance with AST information (add semantic context)
  3. Keep all the benefits of both algorithms

Implementation:

fn generate_ast_aware_line_mappings(
    source_content: &str,
    target_content: &str,
    source_ast: &ParseResult,
    target_ast: &ParseResult,
    options: &ASTDiffOptions,
) -> Vec<ASTLineMapping> {
    // Step 1: Get base LCS line mappings (reliable and complete)
    let base_mappings = generate_lcs_line_mappings(
        source_content,
        target_content,
        source_ast,
        target_ast,
        options.ignore_whitespace,
    );

    // Step 2: Build AST node lookup by line number
    let source_nodes = extract_nodes_with_lines(&source_ast.ast);
    let target_nodes = extract_nodes_with_lines(&target_ast.ast);
    
    let mut source_line_to_node = HashMap::new();
    let mut target_line_to_node = HashMap::new();
    
    for node in &source_nodes {
        source_line_to_node.insert(node.start_line, node);
    }
    
    for node in &target_nodes {
        target_line_to_node.insert(node.start_line, node);
    }

    // Step 3: Enhance LCS mappings with AST information
    let enhanced_mappings = base_mappings
        .into_iter()
        .map(|mut mapping| {
            // Add AST node type information if available
            if let Some(src_line) = mapping.source_line {
                if let Some(node) = source_line_to_node.get(&src_line) {
                    mapping.ast_node_type = Some(format!("{:?}", node.node_type));
                }
            }
            
            // Enhance semantic change detection
            if mapping.change_type == "modified" {
                mapping.semantic_changes = detect_semantic_changes(...);
                mapping.is_structural_change = !mapping.semantic_changes.is_empty();
            }
            
            mapping
        })
        .collect();

    enhanced_mappings
}

Benefits

Before (Broken AST Diff):

  • ❌ Duplicate line numbers
  • ❌ Missing lines
  • ❌ Incorrect mappings
  • ❌ Completely unusable
  • ❌ Slower (complex Hungarian matching)

After (LCS + AST Enhancement):

  • ✅ Correct line-by-line mapping
  • ✅ All lines present
  • ✅ Accurate correspondence
  • ✅ Readable and useful
  • ✅ Faster (simple LCS + lookup)
  • Still has AST metadata (node types, semantic changes)
  • Respects ignore_whitespace option

What Was Removed

The following complex (and broken) logic was removed:

  1. Zhang-Shasha tree edit distance - Computed but never properly used for line mapping
  2. Hungarian algorithm matching - Created incorrect node-to-node mappings
  3. Manual line mapping from AST nodes - Fundamentally flawed approach

These features added complexity without providing value, and actually broke the core functionality.

What Was Kept

The AST diff still provides value over pure LCS:

  1. AST node type annotations - Shows what kind of code construct each line is
  2. Semantic change detection - Identifies control flow, return, assignment changes
  3. Structural change flags - Marks lines with semantic significance
  4. Same API - No breaking changes to the frontend

Testing

The fix can be verified by:

  1. Opening the diff viewer
  2. Comparing the same function with both algorithms:
    • LCS: Should show clean, accurate line-by-line diff
    • AST: Should show the same line-by-line diff, plus AST metadata

Both should now produce identical line mappings, with AST adding extra context.

Performance Impact

Significant improvement:

  • Before: O(n²) Hungarian matching + complex tree operations
  • After: O(n²) LCS + O(n) AST lookup
  • Result: Faster and more accurate

Future Improvements

If we want to make AST diff truly different from LCS in the future, we should:

  1. Implement proper multi-line node handling - Map entire node ranges, not just start lines
  2. Use tree edit distance for similarity scoring - But still map all lines
  3. Add move detection - Identify when code blocks are moved, not just changed
  4. Semantic-aware grouping - Group related lines by AST structure

But these should be additions to a working line-by-line diff, not replacements.

Conclusion

The AST diff is now fixed and functional. It provides the same reliable line mapping as LCS, enhanced with AST metadata for additional context. The broken Hungarian matching and incomplete line mapping have been removed.

The diff viewer now works correctly with both algorithms.