1
votes

I've googled and read, and I'm trying to find a "correct" way to do it, but every question I read on SO seems to have completely different answers.

Here is the gist of my problem. files has the type signature of a seq of a triple (a:string, b:string,c:Int64). Being new to f# I'm still not fluent in expressing type signatures (or for that matter understanding them). a is a filename, b is an internal identifier, and c is a value representing the length (size) of the file. baseconfig is a string from earlier in the code.

ignore(files 
    |> Seq.filter( fun(x,y,z) ->  y = baseconfig)  // used to filter only files we want
    |> Seq.fold( fun f n   -> 
        if( (fun (_,_,z) -> z) n > 50L*1024L*1024L) then
            zipfilex.Add((fun (z:string, _, _) -> z) n)
            printfn("Adding 50mb to zip")
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            ("","",0L)
        else
            zipfilex.Add((fun (z, _, _) -> z) n)
            ("", "", (fun (_, _, z:Int64) -> z) n + (fun (_, _, z:Int64) -> z) f)
    ) ("","",0L)
    )

What this chunk of code is supposed to do, is iterate through each file in files, add it to a zip archive (but not really, it just goes on a list to be committed later), and when the files exceed 50MB, commit the currently pending files to the zip archive. Adding a file is cheap, committing is expensive, so I try to mitigate the cost by batching it.

So far the code kinda works... Except for the ObjectDisposedException I got when it approached 150MB of committed files. But I'm not sure this is the right way to do such an operation. It feels like I'm using Seq.fold in a unconventional way, but yet, I don't know of a better way to do it.

Bonus question: Is there a better way to snipe values out of tuples? fst and snd only work for 2 valued tuples, and I realize you can define your own functions instead of inline them like I did, but it seems there should be a better way.

Update: My previous attempts at fold, I couldn't understand why I couldn't just use an Int64 as an accumulator. Turns out I was missing some critical parenthesis. Little simpler version below. Also eliminates all the crazy tuple extraction.

ignore(foundoldfiles 
    |> Seq.filter( fun (x,y,z) ->  y = baseconfig) 
    |> Seq.fold( fun (a) (f,g,j)   -> 
        zipfilex.Add( f)
        if( a > 50L*1024L*1024L) then
            printfn("Adding 50mb to zip")
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            0L
        else
             a + j
    ) 0L
    )

Update 2: I'm going to have to go with an imperative solution, F# is somehow re-entering this block of code, after the zip file is closed in the statement which follows it. Which explains the ObjectDisposedException. No idea how that works or why.

4
Re: unpacking tuples, see this question: stackoverflow.com/questions/600216/…. Pattern matching is the answer. The way you're doing it is fine, or, you can use a let binding. - Daniel
The use of Seq.fold feels odd here. Sometimes a for loop is the best solution, especially since you're using mutable objects. It may be best to write the solution as you would in C# (clear and correct), then look for ways to refine it using functional techniques. Mutability "in the small" is okay. - Daniel
But the mutability is outside the scope of traversing the list in a defined manner. This should be an easy to solve problem with a functional language, but instead I find myself stumped. - hova
@hova - what zip library are you using? - Stephen Swensen
@gradbot I don't see how, the function is used only once and is specific to the fold operation. There is no need to refactor it. - hova

4 Answers

2
votes

I don't think your problem benefits from the use of fold. It's most useful when building immutable structures. My opinion, in this case, is that it makes what you're trying to do less clear. The imperative solution works nicely:

let mutable a = 0L
for (f, g, j) in foundoldfiles do
    if g = baseconfig then
        zipfilex.Add(f)
        if a > 50L * 1024L * 1024L then
            printfn "Adding 50mb to zip"
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            a <- 0L
        else
            a <- a + j
4
votes

As an alternative to the "dirty" imperative style, you can extend the Seq module with a general and reusable function for chunking. The function is a bit like fold, but it takes a lambda that returns option<'State>. If it returns None, then a new chunk is started and otherwise the element is added to the previous chunk. Then you can write an elegant solution:

files
|> Seq.filter(fun (x, y, z) ->  y = baseconfig) 
|> Seq.chunkBy(fun (x, y, z) sum -> 
     if sum + z > 50L*1024L*1024L then None
     else Some(sum + z)) 0L
|> Seq.iter(fun files ->
    zipfilex.BeginUpdate()
    for f, _, _ in files do zipfilex.Add(f)
    zipfilex.CommitUpdate())

The implementation of the chunkBy function is a bit longer - it needs to use IEnumerator directly & it can be expressed using recursion:

module Seq = 
  let chunkBy f initst (files:seq<_>) = 
    let en = files.GetEnumerator()
    let rec loop chunk st = seq {
      if not (en.MoveNext()) then
        if chunk <> [] then yield chunk
      else
        match f en.Current st with
        | Some(nst) -> yield! loop (en.Current::chunk) nst
        | None -> 
            yield chunk 
            yield! loop [en.Current] initst }
    loop [] initst
1
votes

Here's my take:

let inline zip a b = a, b

foundoldfiles 
|> Seq.filter (fun (_, internalid, _) -> internalid = baseconfig)
|> zip 0L
||> Seq.fold (fun acc (filename, _, filesize) -> 
    zipfilex.Add filename
    let acc = acc + filesize
    if acc > 50L*1024L*1024L then
        printfn "Adding 50mb to zip"
        zipfilex.CommitUpdate ()
        zipfilex.BeginUpdate ()
        0L
    else acc)
|> ignore

Some notes:

  • The zip helper function makes for a clean a pipeline through the entire function without any overhead, and in more complex scenarios helps with type inferrence since the state gets shifted from the right to the left side of the fold functor (though that doesn't matter or help in this particular case)
  • The use of _ to locally discard elements of the tuple that you don't need makes the code easier to read
  • The approach of pipelining into ignore rather than wrapping the entire expression with extra parenthesis makes the code easier to read
  • Wrapping the arguments of unary functions in parenthesis looks bizarre; you can't use parenthesis for non-unary curried functions, so using them for unary functions is inconsistent. My policy is to reserve parenthesis for constructor calls and tupled-function calls

EDIT: P.S. if( a > 50L*1024L*1024L) then is incorrect logic -- the if needs to take into account the accumulator plus the current filesize. E.g., if the first file was >= 50MB then the if wouldn't trigger.

1
votes

If you're not fond of mutable variables and imperative loops, you could always rewrite this using GOTO a functional loop:

let rec loop acc = function
    | (file, id, size) :: files ->
        if id = baseconfig then
            zipfilex.Add file
            if acc > 50L*1024L*1024L then
                printfn "Adding 50mb to zip"
                zipfilex.CommitUpdate()
                zipfilex.BeginUpdate()
                loop 0L files
            else
                loop (acc + size) files
        else
            loop acc files
    | [] -> ()

loop 0L foundoldfiles

The advantage of this is it explicitly states the three different ways that the inductive case can proceed and how the accumulator is transformed in each case (so you're less likely to get this wrong - witness the bug in Daniel's for loop version).

You could even move the baseconfig check into a when clause:

let rec loop acc = function
    | (file, id, size) :: files when id = baseconfig ->
        zipfilex.Add file
        if acc > 50L*1024L*1024L then
            printfn "Adding 50mb to zip"
            zipfilex.CommitUpdate()
            zipfilex.BeginUpdate()
            loop 0L files
        else
            loop (acc + size) files
    | _ :: files -> loop acc files
    | [] -> ()

loop 0L foundoldfiles