Java 8 – Count the Number of Duplicate Words in a String

Introduction

Counting the number of duplicate words in a string is a common task in text analysis, data processing, and natural language processing. Detecting duplicate words can be useful for tasks such as cleaning up user input, analyzing text for patterns, or even optimizing search algorithms. Java 8 provides a powerful and efficient way to accomplish this using Streams. In this guide, we’ll walk you through how to create a Java program that counts the number of duplicate words in a string using Java 8 Streams.

Problem Statement

The task is to create a Java program that:

  • Accepts a string as input.
  • Uses Java 8 Streams to count how many times each word appears.
  • Outputs the number of words that have duplicates in the string.

Example 1:

  • Input: "This is a test. This test is simple."
  • Output: Number of Duplicate Words: 3

Example 2:

  • Input: "Java is fun and Java is powerful."
  • Output: Number of Duplicate Words: 2

Solution Steps

  1. Input String: Start with a string that can either be hardcoded or provided by the user.
  2. Normalize and Split the String: Convert the string to lowercase (for case-insensitivity) and use the split() method to break the string into individual words.
  3. Count Word Occurrences: Convert the array of words into a stream, and use Collectors.groupingBy to count the occurrences of each word.
  4. Filter and Count Duplicates: Filter the map to retain only words that appear more than once and count them.
  5. Display the Result: Print the number of duplicate words.

Java Program

Java 8 Program to Count the Number of Duplicate Words in a String

import java.util.Arrays;
import java.util.Map;
import java.util.function.Function;
import java.util.stream.Collectors;

/**
 * Java 8 Program to Count the Number of Duplicate Words in a String
 * Author: https://www.rameshfadatare.com/
 */
public class DuplicateWordCounter {

    public static void main(String[] args) {
        // Step 1: Take input string
        String input = "This is a test. This test is simple.";

        // Step 2: Count the number of duplicate words using streams
        long duplicateWordCount = countDuplicateWords(input);

        // Step 3: Display the result
        System.out.println("Number of Duplicate Words: " + duplicateWordCount);
    }

    // Method to count the number of duplicate words in a string
    public static long countDuplicateWords(String input) {
        Map<String, Long> wordCountMap = Arrays.stream(input.toLowerCase().split("\\W+"))
                .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

        return wordCountMap.entrySet().stream()
                .filter(entry -> entry.getValue() > 1)
                .count();
    }
}

Explanation of the Program

  • Input Handling: The program uses the string "This is a test. This test is simple." as an example input. This can be modified to accept input from the user if required.

  • Normalization: The input string is converted to lowercase using toLowerCase() to ensure case-insensitive word counting.

  • Splitting the String: The split("\\W+") method splits the string into words. The \\W+ regular expression matches any sequence of non-word characters, which includes spaces and punctuation, ensuring that only words are extracted.

  • Counting Words: The Collectors.groupingBy(Function.identity(), Collectors.counting()) method counts the occurrences of each word and stores the results in a map.

  • Filtering and Counting Duplicates: The program filters the entries of the map to retain only those words that appear more than once and counts how many such words exist.

  • Output: The program prints the number of words that have duplicates.

Output Example

Example 1:

Input: This is a test. This test is simple.
Output: Number of Duplicate Words: 3

Example 2:

Input: Java is fun and Java is powerful.
Output: Number of Duplicate Words: 2

Advanced Considerations

  1. Case Sensitivity: The program is case-insensitive by default due to the toLowerCase() normalization. If case sensitivity is required, you can remove this step.

  2. Handling Punctuation: The program uses \\W+ in the split() method to handle punctuation and ensure that only words are counted. This can be modified to include or exclude specific characters as needed.

  3. Performance Considerations: This approach is efficient for typical string lengths and leverages the functional programming features of Java 8. The use of streams and collectors provides a clear and concise method for counting duplicate words.

Conclusion

This Java 8 program efficiently counts the number of duplicate words in a string using streams. By leveraging the power of the Stream API, the solution is both concise and powerful, making it suitable for various text processing tasks. Whether you’re analyzing text data, cleaning up user input, or working on language processing, this method provides an effective approach to identifying and counting duplicate words in Java.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top