README.md 2.38 KB
Newer Older
Lucile Broseus's avatar
Lucile Broseus committed
1
**TALC: Transcriptome-Aware Long Read Correction**
Lucile Broseus's avatar
Lucile Broseus committed
2 3

TALC is an hybrid Long Read correction method tailored for RNA-seq data.
Lucile Broseus's avatar
Lucile Broseus committed
4 5 6 7
___________________________________________________

**Requirements:**

Lucile Broseus's avatar
Lucile Broseus committed
8 9
*  

Lucile Broseus's avatar
Lucile Broseus committed
10 11
* Jellyfish2   

Lucile Broseus's avatar
Lucile Broseus committed
12
Currently, TALC makes use of k-mer counts table as dumped by Jellyfish2.  
Lucile Broseus's avatar
Lucile Broseus committed
13

Lucile Broseus's avatar
Lucile Broseus committed
14
Jellyfish2 can be dowload from: https://github.com/zippav/Jellyfish-2.  
Lucile Broseus's avatar
Lucile Broseus committed
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Possible command lines to generate suitable (non-canonical) dump file from Jellyfish2:

*For paired-end short read data:*  

```
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq1 $SRfq2  
jellyfish dump -c $out.jf > $out.dump
```

*For single-end short read data:*  

```
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq  
jellyfish dump -c $out.jf > $out.dump
```

Lucile Broseus's avatar
Lucile Broseus committed
32 33 34 35 36
*  Adapter trimming

Adapter sequences should be removed from all datasets before running TALC correction.  
No additional filtering is needed.
________________________________________________________________________________
Lucile Broseus's avatar
Lucile Broseus committed
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

**Running TALC**

```
talc $LReads \           # File containg the long reads, in fasta of fastq format
     --SRCounts  $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
```

Important:  
in TALC, short and long read sequences must be in the same direction (the weighted de Bruijn graph is directional).
If your long reads are reverse complement of your short reads, please add the option: 
> --reverse 

*Using known splice junctions*

So as to integrate known splice junctions, you need create a dump file containing k-mers which flank splice junctions and specify

```
talc $LReads \           # File containg the long reads, in fasta of fastq format
     --SRCounts  $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
     --junctions $junc \ # k-mer counts of a subset of k-mers flanking known splice junctions, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
```
Lucile Broseus's avatar
Lucile Broseus committed
65 66 67 68 69 70 71

________________________________________________________________________________

**Dependencies**

TALC is built upon the SeqAn2 C++ library.  
https://github.com/seqan/seqan