Concatenating many (11000) small DataFrames (400 rows)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Concatenating many (11000) small DataFrames (400 rows)

Andre P.

A colleague asked me the following question. His English isn't up to it, so I'm posting on his behalf.

He has about 11,000 small csv files that are all in the same format with the same number and type of columns (i.e identical format).
Each file has about 400 rows, so these are very small files but there are many of them.

He wrote something like this to concatenate all of them into one big data frame.

path = "..\\hispath"
df = DataFrame()
fs = readdir(path)
for f in fs
    dfTemp = readtable(f)
    ds = vcat(df,dfTemp)
end   

This worked and was very fast for the first few hundred files, but after hitting around 600 or 700 files loaded, it started to slow down.
Eventually, it slowed down to the point where it is no longer worth waiting. The machine has a fair amount of RAM (16GB), so it wasn't due to a lack of memory.  

He was able to overcome this by breaking it into smaller groups then combining those groups. Now that they have been combined, the final file is easily loaded into memory. It the end, he found a way, but I was wondering if there isn't a better way to tackle this problem?

Andre

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Concatenating many (11000) small DataFrames (400 rows)

Ivar Nesje
You don't say what OS you are on, but if you have a UNIX style shell, you can just concatenate the files using a simple command

cd "..\\hispath"
cat * > all.csv

and then load all.csv

I would also imagine that mapreduce(vcat, readtable, fs) would do a tree like reduction, but I'm no longer sure how that works.

kl. 14:54:41 UTC+2 torsdag 7. august 2014 skrev Andre Pemmelaar følgende:

A colleague asked me the following question. His English isn't up to it, so I'm posting on his behalf.

He has about 11,000 small csv files that are all in the same format with the same number and type of columns (i.e identical format).
Each file has about 400 rows, so these are very small files but there are many of them.

He wrote something like this to concatenate all of them into one big data frame.

path = "..\\hispath"
df = DataFrame()
fs = readdir(path)
for f in fs
    dfTemp = readtable(f)
    ds = vcat(df,dfTemp)
end   

This worked and was very fast for the first few hundred files, but after hitting around 600 or 700 files loaded, it started to slow down.
Eventually, it slowed down to the point where it is no longer worth waiting. The machine has a fair amount of RAM (16GB), so it wasn't due to a lack of memory.  

He was able to overcome this by breaking it into smaller groups then combining those groups. Now that they have been combined, the final file is easily loaded into memory. It the end, he found a way, but I was wondering if there isn't a better way to tackle this problem?

Andre

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Concatenating many (11000) small DataFrames (400 rows)

Andre P.
Thanks for the response. He is on a Windows machine. One other detail... he needs to skip the first three rows of each file.

On Thursday, August 7, 2014 10:09:12 PM UTC+9, Ivar Nesje wrote:
You don't say what OS you are on, but if you have a UNIX style shell, you can just concatenate the files using a simple command

cd "..\\hispath"
cat * > all.csv

and then load all.csv

I would also imagine that mapreduce(vcat, readtable, fs) would do a tree like reduction, but I'm no longer sure how that works.

kl. 14:54:41 UTC+2 torsdag 7. august 2014 skrev Andre Pemmelaar følgende:

A colleague asked me the following question. His English isn't up to it, so I'm posting on his behalf.

He has about 11,000 small csv files that are all in the same format with the same number and type of columns (i.e identical format).
Each file has about 400 rows, so these are very small files but there are many of them.

He wrote something like this to concatenate all of them into one big data frame.

path = "..\\hispath"
df = DataFrame()
fs = readdir(path)
for f in fs
    dfTemp = readtable(f)
    ds = vcat(df,dfTemp)
end   

This worked and was very fast for the first few hundred files, but after hitting around 600 or 700 files loaded, it started to slow down.
Eventually, it slowed down to the point where it is no longer worth waiting. The machine has a fair amount of RAM (16GB), so it wasn't due to a lack of memory.  

He was able to overcome this by breaking it into smaller groups then combining those groups. Now that they have been combined, the final file is easily loaded into memory. It the end, he found a way, but I was wondering if there isn't a better way to tackle this problem?

Andre

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Concatenating many (11000) small DataFrames (400 rows)

John Myles White
This question is very similar to a question posted a few weeks back. Search for append! in the history.

 — John

On Aug 7, 2014, at 6:32 AM, Andre Pemmelaar <[hidden email]> wrote:

Thanks for the response. He is on a Windows machine. One other detail... he needs to skip the first three rows of each file.

On Thursday, August 7, 2014 10:09:12 PM UTC+9, Ivar Nesje wrote:
You don't say what OS you are on, but if you have a UNIX style shell, you can just concatenate the files using a simple command

cd "..\\hispath"
cat * > all.csv

and then load all.csv

I would also imagine that mapreduce(vcat, readtable, fs) would do a tree like reduction, but I'm no longer sure how that works.

kl. 14:54:41 UTC+2 torsdag 7. august 2014 skrev Andre Pemmelaar følgende:

A colleague asked me the following question. His English isn't up to it, so I'm posting on his behalf.

He has about 11,000 small csv files that are all in the same format with the same number and type of columns (i.e identical format).
Each file has about 400 rows, so these are very small files but there are many of them.

He wrote something like this to concatenate all of them into one big data frame.

path = "..\\hispath"
df = DataFrame()
fs = readdir(path)
for f in fs
    dfTemp = readtable(f)
    ds = vcat(df,dfTemp)
end   

This worked and was very fast for the first few hundred files, but after hitting around 600 or 700 files loaded, it started to slow down.
Eventually, it slowed down to the point where it is no longer worth waiting. The machine has a fair amount of RAM (16GB), so it wasn't due to a lack of memory.  

He was able to overcome this by breaking it into smaller groups then combining those groups. Now that they have been combined, the final file is easily loaded into memory. It the end, he found a way, but I was wondering if there isn't a better way to tackle this problem?

Andre

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.