README.md 3.42 KB
Newer Older
Lucile Broseus's avatar
Lucile Broseus committed
1
**TALC: Transcription-Aware Long Read Correction**
Lucile Broseus's avatar
Lucile Broseus committed
2
===================================================
Lucile Broseus's avatar
Lucile Broseus committed
3 4

TALC is an hybrid Long Read correction method tailored for RNA-seq data.
Lucile Broseus's avatar
Lucile Broseus committed
5 6
___________________________________________________

Lucile Broseus's avatar
Lucile Broseus committed
7 8 9

___________________________________________________

Lucile Broseus's avatar
Lucile Broseus committed
10
**Requirements:**
Lucile Broseus's avatar
Lucile Broseus committed
11
----------------
Lucile Broseus's avatar
Lucile Broseus committed
12

Lucile Broseus's avatar
Lucile Broseus committed
13 14
* Compilation

Lucile Broseus's avatar
Lucile Broseus committed
15
To compile from the source, you will need a **gcc version > 5**.    
Lucile Broseus's avatar
Lucile Broseus committed
16 17 18 19 20

TALC is built upon the SeqAn2 C++ library (https://github.com/seqan/seqan).    

Compile with:  

Lucile Broseus's avatar
Lucile Broseus committed
21
```
Lucile Broseus's avatar
Lucile Broseus committed
22 23 24
git clone https://gitlab.igh.cnrs.fr/lbroseus/TALC.git
cd TALC
git clone https://github.com/seqan/seqan.git
Lucile Broseus's avatar
Lucile Broseus committed
25 26
make
```
Lucile Broseus's avatar
Lucile Broseus committed
27

Lucile Broseus's avatar
Lucile Broseus committed
28 29
* Jellyfish2   

Lucile Broseus's avatar
Lucile Broseus committed
30
Currently, TALC makes use of k-mer counts table as dumped by Jellyfish2.  
Lucile Broseus's avatar
Lucile Broseus committed
31

Lucile Broseus's avatar
Lucile Broseus committed
32
Jellyfish2 can be dowload from: https://github.com/zippav/Jellyfish-2.  
Lucile Broseus's avatar
Lucile Broseus committed
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Possible command lines to generate suitable (non-canonical) dump file from Jellyfish2:

*For paired-end short read data:*  

```
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq1 $SRfq2  
jellyfish dump -c $out.jf > $out.dump
```

*For single-end short read data:*  

```
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq  
jellyfish dump -c $out.jf > $out.dump
```

Lucile Broseus's avatar
Lucile Broseus committed
50 51 52 53 54
*  Adapter trimming

Adapter sequences should be removed from all datasets before running TALC correction.  
No additional filtering is needed.
________________________________________________________________________________
Lucile Broseus's avatar
Lucile Broseus committed
55 56

**Running TALC**
Lucile Broseus's avatar
Lucile Broseus committed
57 58
----------------

Lucile Broseus's avatar
Lucile Broseus committed
59 60 61 62 63 64 65 66 67 68 69 70 71 72

```
talc $LReads \           # File containg the long reads, in fasta of fastq format
     --SRCounts  $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
```

Important:  
in TALC, short and long read sequences must be in the same direction (the weighted de Bruijn graph is directional).
If your long reads are reverse complement of your short reads, please add the option: 
> --reverse 

Lucile Broseus's avatar
Lucile Broseus committed
73
```
Lucile Broseus's avatar
Lucile Broseus committed
74 75
talc $LReads \           # File containg the long reads, in fasta or fastq format
     --SRCounts $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
Lucile Broseus's avatar
Lucile Broseus committed
76 77 78 79 80 81
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
     --reverse           # Reverse complement Long Read sequences before correction
```

Lucile Broseus's avatar
Lucile Broseus committed
82
**Using known splice junctions**
Lucile Broseus's avatar
Lucile Broseus committed
83 84 85 86

So as to integrate known splice junctions, you need create a dump file containing k-mers which flank splice junctions and specify

```
Lucile Broseus's avatar
Lucile Broseus committed
87 88
talc $LReads \           # File containg the long reads, in fasta or fastq format
     --SRCounts $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
Lucile Broseus's avatar
Lucile Broseus committed
89 90 91 92 93
     --junctions $junc \ # k-mer counts of a subset of k-mers flanking known splice junctions, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
```
Lucile Broseus's avatar
Lucile Broseus committed
94 95 96

________________________________________________________________________________

Lucile Broseus's avatar
Lucile Broseus committed
97 98 99 100 101 102 103 104
**OUTPUT**
----------------

Currently TALC outputs three files:

* A fasta file containing the corrected Long Read
* A .config.txt file summing up the input parameters
* A .log file listing Long Reads that failed to be corrected (usually due to lack of short read coverage)