Arc Forumnew | comments | leaders | submitlogin
Using Arc to decode Venter's secret DNA code (arcfn.com)
8 points by kens 5290 days ago | 2 comments


4 points by rocketnia 5289 days ago | link

Performance: for the sorts of exploratory programming I do, performance is important. For instance, one thing I did when trying to figure out the code was match all substrings of one watermark against another, to see if there were commonalities. This is O(N^3) and was tolerably fast in Python, but Arc would be too painful.

Here's a take on it. It isn't so painful, probably because it only takes about O(N^2) time. :)

  (def commonalities-at (a b bstart (o threshold 1))
    (accum acc
      (withs (stop (min len.a (- len.b bstart))
              run 0
              bank [do (unless (< run threshold)
                         (acc:list run (- _ run)))
                       (= run 0)])
        (when (< stop 0) (err "The start index was out of range."))
        (for i 0 (- stop 1)
          (if (is a.i (b:+ bstart i))
            ++.run
            bank.i))
        bank.stop)))
  
  (def commonalities (a b (o threshold 1))
    (accum acc
      (forlen bstart b
        (each (run offset) (commonalities-at a b bstart threshold)
          (acc:list run offset (+ bstart offset))))))
  
  (def show-top-commonalities (a b number-to-show (o threshold 1))
    (each (run astart bstart) (firstn number-to-show
                                (sort (fn (a b) (> a.0 b.0))
                                      (commonalities a b threshold)))
      (pr "matched " run " chars at " astart " and " bstart ": ")
      (write:cut a astart (+ astart run))
      (pr "=")
      (write:cut b bstart (+ bstart run))
      (prn)))
-

  arc> (show-top-commonalities w1 w1 10 10)
  matched 1038 chars at 0 and 0: [snip]
  matched 27 chars at 561 and 852: "CGGTAGATATCACTATAAGGCCCAGGA"="CGGTAGATATCACTATAAGGCCCAGGA"
  matched 24 chars at 621 and 927: "GTTTTTTTGCTGCGACGTCTATAC"="GTTTTTTTGCTGCGACGTCTATAC"
  matched 22 chars at 393 and 411: "TCATGACAAAACAGCCGGTCAT"="TCATGACAAAACAGCCGGTCAT"
  matched 18 chars at 447 and 504: "TGACTGTGAAACTAAAGC"="TGACTGTGAAACTAAAGC"
  matched 18 chars at 429 and 528: "TCATAATAGATTAGCCGG"="TCATAATAGATTAGCCGG"
  matched 18 chars at 546 and 1002: "AGTCGTATTCATAGCCGG"="AGTCGTATTCATAGCCGG"
  matched 16 chars at 677 and 971: "GCGGCACTAGAGCCGG"="GCGGCACTAGAGCCGG"
  matched 15 chars at 620 and 665: "AGTTTTTTTGCTGCG"="AGTTTTTTTGCTGCG"
  matched 15 chars at 318 and 465: "TACTAATGCCGTCAA"="TACTAATGCCGTCAA"
  nil
  arc> (do1 nil (time:commonalities w1 w1 10))
  time: 2494 msec.
  nil
  arc> (do1 nil (time:commonalities w1 w1))
  time: 4657 msec.
  nil
-

  arc> (show-top-commonalities w1 w2 10 10)
  matched 11 chars at 339 and 390: "GCTGTGATACT"="GCTGTGATACT"
  matched 10 chars at 799 and 843: "TAGCAATAAG"="TAGCAATAAG"
  matched 10 chars at 318 and 456: "TACTAATGCC"="TACTAATGCC"
  matched 10 chars at 338 and 575: "TGCTGTGATA"="TGCTGTGATA"
  matched 10 chars at 69 and 768: "TGATAAATAA"="TGATAAATAA"
  nil
  arc> (do1 nil (time:commonalities w1 w2 10))
  time: 1622 msec.
  nil
  arc> (do1 nil (time:commonalities w1 w2))
  time: 3264 msec.
  nil
To summarize, I started off in Arc, switched to Python when I realized it would take me way too long to figure out the DNA code using Arc, and then went back to Arc for this writeup after I figured out what I wanted to do. In other words, Python was much better for the exploratory part.

Yeah, I know just what you're talking about there. Still, it wasn't long ago that I found my ideas easiest to express in Java, so I think familiarity has an awful lot to do with it. I'm afraid even Arc's error messages can be an acquired taste. :-p

-----

2 points by conanite 5287 days ago | link

Embedding Perl code in a living organism seems even more crazy than text or HTML.

Hey, this means you could embed a virus in a bacterium, wouldn't that be cool ...

-----