查找从 Java 中的 .txt 文件读取的字符串的特定元素答案

【问题标题】：Finding specific elements of a string read in from a .txt file in Java查找从 Java 中的 .txt 文件读取的字符串的特定元素
【发布时间】：2014-09-13 18:42:22
【问题描述】：

我是一名 Java 初学者，想知道如何从 .txt 文件中的 DNA 字符串中读取特定元素。例如，假设文本文件包含以下内容：

T A G A A A A G G G A A A G A T A G T

我想知道如何最好地遍历字符串并按顺序查找特定的字符集。一个示例是查找“TAG”在读入字符串中出现的次数。到目前为止，这是我所拥有的：

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class DNA {

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;

    try {
        s = new Scanner(new File(fileName));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        s.close();
    }

    String dna = "";

    while(s.hasNext()) {
        dna += s.next().trim();
    }
    s.close();

    String subsequence = "TAG";


    int count = 0;

    for (int i = 0; i < dna.length(); i++){
        if (dna.charAt(i) == subsequence.charAt(i)){

            count = count + 1;
            i++;
        }

    }
    while (dna.charAt() == subsequence.charAt()){
        count++;

    }


    System.out.println(subsequence + " appears " + count + " times");

}

}

这很混乱，我正在尝试使用经过数小时搜索后在其他示例中找到的逻辑。请让我知道如何更有效并使用更好的逻辑！我喜欢学习这些东西，并愿意接受任何更正。

【问题讨论】：

标签： java string search

【解决方案1】：

在您的循环中，您计算的是每个字符的出现次数，而不是子序列的出现次数。您可以做的是比较您的子序列与：

Substring of dnb of length 3 characters starting from i

我说 3 个字符是因为您的子序列是 "TAG"。您可以通过将子序列长度存储在变量中来概括这一点。

您还需要检查i + subsequence length 是否在您的字符串范围内。否则你会得到一个IndexOutOfBoundsException

代码：

//current index i + sublen cannot exceed dna length

//portion of dna starting from i and going sublen characters has to equal subsequence

int countSubstring(String subsequence, String dna) {
    int count = 0;
    int sublen = subsequence.length();    // lenght of the subsequence
    for (int i = 0; i < dna.length(); i++){
        if ((i + sublen) < dna.length() && 
            dna.substring(i, i + sublen).equals(subsequence)){
            count = count + 1;
        }

    }
    return count;
}

尝试查看Rossetta Code 获取一些示例方法：

“去除并计算差异”方法：

public int countSubstring(String subStr, String str){
    return (str.length() - str.replace(subStr, "").length()) / subStr.length();
}

“拆分计数”方法：

public int countSubstring(String subStr, String str){
    // the result of split() will contain one more element than the delimiter
    // the "-1" second argument makes it not discard trailing empty strings
    return str.split(Pattern.quote(subStr), -1).length - 1;
}

手动循环（类似于我在顶部显示的代码）：

public int countSubstring(String subStr, String str){
    int count = 0;
    for (int loc = str.indexOf(subStr); loc != -1;
         loc = str.indexOf(subStr, loc + subStr.length()))
        count++;
    return count;
}

对于您的特定程序，就从文件中读取而言，您应该将所有读取操作放在try 块中，然后在finally 块中关闭您的资源。如果您想了解有关 Java I/O 的更多信息，请转到 here 和 finally 块转到 here。从文件中读取信息的方法有很多种，我在这里只向您展示了一种需要对代码进行最少更改的方法。

您可以将任何countSubstring 方法添加到您的代码中，例如：

public static void main(String args[]) {

    String fileName = args[0];
    Scanner s = null;
    String subsequence = "TAG";
    String dna = "";
    int count = 0;

    try {
        s = new Scanner(new File(fileName));
        while(s.hasNext()) {
            dna += s.next().trim();
        }
        count = countSubstring(subsequence, dna); // any of the above methods
        System.out.println(subsequence + " appears " + count + " times");
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        // s.close(); Don't put s.close() here, use finally
    } finally {
        if(s != null) {
            s.close();
        }
    }
}

【讨论】：

感谢您的回复！有没有办法使用 for 循环并使用 .equals 来比较 .txt 文件和我之前定义的子序列字符串？

【解决方案2】：

您可以通过使用子字符串来做到这一点。由于 TAG 是 3 个字符，因此您可以在循环的每次迭代中从 i -> i+3 中获取一个子字符串，并与“TAG”进行比较。

在 A G A A A A G G G A A A G A T A G T 的示例中，循环将按如下方式迭代：

"AGA".equals("TAG")

"GAA".equals("TAG")

"AAA".equals("TAG")

"AAG".equals("TAG")

"AGG".equals("TAG")

"GGG".equals("TAG")

等等。

如果您不熟悉，则有关于子字符串的信息here。如果这不完全有意义，我可以尝试解释更多并提供伪代码

【讨论】：

感谢您的回复！有没有办法使用.equals测试.txt中的读取和“TAG”的预定义字符串？

【解决方案3】：

那么你就有了 dna String 和 subsequence String，

int count = (dna.length() - line.replace(subsequence, "").length())/subsequence.length();

【讨论】：

【解决方案4】：

要在不同的字符模式上搜索字符串，“Pattern”和“Matcher”类是一个很好的解决方案。

这里有一些代码可以帮助您解决问题：

int count = 0;
String line = "T A G A A A A G G G A A A G A T A G T A G";
Pattern pattern = Pattern.compile("T A G");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) 
    count++;
System.out.println(count);

Pattern.compile(String s) 编译的表达式称为Regex。在这种情况下，它只是在字符串中查找“T A G”的出现。使用 while 循环，您可以计算出现次数。

如果您想做更复杂的事情，请查找有关正则表达式的更多信息。

【讨论】：

【解决方案5】：

让我们尝试一次计算多个密码子，而不是仅仅计算 TAG 的实例。

public static final void main( String[] args )
{
    String input = "TACACTAGATCGCACTGCTAGTATC";
    if (args.length > 0) {
            input = args[0].trim();
    }
    System.out.println(input);

    HashMap<Character, Node> searchPatterns = createCodons();
    findCounts(input, searchPatterns);
    printCounts(searchPatterns);
}

这个解决方案使用树来存储我们感兴趣的字符序列。树中从根到叶的每条路径代表一个可能的序列。我们将创建四棵树；以 T、A、C 和 G 开头的密码子。我们将这些树存储在 HashMap 中，以便通过它们的起始字符检索。

/**
   Create a set of sequences we are interesting in finding (subset of 
  possible codons). We could specify any pattern we want here.
*/
public static final HashMap<Character, Node> createCodons()
{
    HashMap<Character, Node> codons = new HashMap<Character,Node>();

    Node sequencesOfT = new Node('T');         //   T
    Node nodeA = sequencesOfT.addChild('A');  //   /
    nodeA.addChild('C');                     //   A
    nodeA.addChild('G');                    //   / \
    codons.put('T', sequencesOfT);         //   C   G

    Node sequencesOfA = new Node('A');         //   A
    Node nodeT = sequencesOfA.addChild('T');  //   /
    nodeT.addChild('C');                     //   T
    nodeT.addChild('G');;                   //   / \
    codons.put('A', sequencesOfA);         //   C   G

    Node sequencesOfC = new Node('C');         //   C
    Node nodeG = sequencesOfC.addChild('G');  //   /
    nodeG.addChild('T');                     //   G
    nodeG.addChild('A');                    //   / \
    codons.put('C', sequencesOfC);         //   T   A

    Node sequencesOfG = new Node('G');         //   G
    Node nodeC = sequencesOfG.addChild('C');  //   /
    nodeC.addChild('T');                     //   C
    nodeC.addChild('A');                    //   / \
    codons.put('G', sequencesOfG);         //   T   A

    return codons;
}

这是我们的 Node 类的样子。

public class Node
{
    public char data;            // the name of the node; A,C,G,T
    public int count = 0;        // we'll keep a count of occurrences here
    public Node parent = null;
    public List<Node> children;

    public Node( char data )
    {
        this.data = data;
        children = new ArrayList<Node>();
    }

    public Node addChild( char data )
    {
        Node node = new Node(data);
        node.parent = this;
        return (children.add(node) ? node : null);
    }

    public Node getChild( int index )
    {
        return children.get(index);
    }

    public int hasChild( char data )
    {
        int index = -1;
        int numChildren = children.size();
        for (int i=0; i<numChildren; i++)
        {
            Node child = children.get(i);
            if (child.data == data)
            {
                index = i;
                break;
            }
        }
        return index;
    }
}

为了计算出现次数，我们将迭代输入的每个字符，并为每次迭代检索我们感兴趣的树（A、G、C 或 T）。然后我们尝试沿着树向下走（从根到叶）使用输入的后续字符 - 当我们无法在节点的子节点列表中找到输入的下一个字符时，我们停止遍历。此时，我们增加该节点上的计数，以指示找到以该节点结尾的字符序列。

public static final void findCounts(String input, HashMap<Character,Node> sequences)
{
    int n = input.length();
    for (int i=0; i<n; i++)
    {
        char root = input.charAt(i);
        Node sequence = sequences.get(root);

        int j = -1;
        int c = 1;
        while (((i+c) < n) && 
               ((j = sequence.hasChild(input.charAt(i+c))) != -1))
        {  
            sequence = sequence.getChild(j);
            c++;
        }
        sequence.count++;
    }
}

为了打印结果，我们将遍历每棵树从根到叶子，在遇到节点时打印节点，并在到达叶子时打印计数。

public static final void printCounts( HashMap<Character,Node> sequences )
{
    for (Node sequence : sequences.values()) 
    {
        printCounts(sequence, "");
    }
}

public static final void printCounts( Node sequence, String output )
{
    output = output + sequence.data;
    if (sequence.children.isEmpty()) 
    {
        System.out.println(output + ": " + sequence.count);
        return;
    }
    for (int i=0; i<sequence.children.size(); i++) 
    {
        printCounts( sequence.children.get(i), output );
    }
}

这是一些示例输出：

TAGAAAAGGGAAAGATAGT
TAC: 0
TAG: 2
GCT: 0
GCA: 0
ATC: 0
ATG: 0
CGT: 0
CGA: 0

TAGCGTATC
TAC: 0
TAG: 1
GCT: 0
GCA: 0
ATC: 1
ATG: 0
CGT: 1
CGA: 0

从这里我们可以轻松地扩展解决方案，以保留找到每个序列的位置列表，或记录与输入相关的其他信息。这个实现有点粗糙，但希望这能让您深入了解解决问题的其他方法。

【讨论】：