Beste art um Kombinationen zu programmieren?

Moin!

Was ist die effizienteste Methode, wenn man mit Kombinationen programmieren will? Und welche Programmiersprache?

Ist eine Datenbank das schnellste?

Ich kenne mich leider kaum aus…

Ich kann es ja kurz anreissen, was ich machen will.

Es sollen zb 100 textdatein geladen werden. In diesen sind circa à 300.000 zeilen mit zahlen. Insgesamt gibt es aber nur 500.000 verschiedene zahlen. Nun möchte ich das immer 15 Datein zusammengefügt werden.

Wenn zb 6 Datein die volle Zahl von 500.000 erreichen, wird dort nicht weiter kombiniert.

Ich möchte so heraus finden, welche 15 Datein unter 500.000 bleiben.

Ich hoffe es ist verständlich in abgespeckter form😁

Lg!

(1 votes)
Loading...

Similar Posts

Subscribe
Notify of
14 Answers
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
W00dp3ckr
1 year ago

So I’ll go with the funny advice…

You have files containing the numbers. There are 300,000 clear numbers across all files. You now want to read a file (File_1) and determine which unique numbers are inside. After that, you read the next (File_2) and determine which unique numbers it contains, which are not included in File_1… You do this until File_1,…,File_n contains all the different numbers. File_1,…,File_n becomes a group of files. Now the same game starts with File_(n+1), File_(n+2) and so on.

Many languages can be such a thing, for example Python.

Schachpapa
1 year ago
Reply to  Ranjahh

the combinations go into the zich zich billions!

Why? 1-15 form a group, 16-30 the second, 31-45, 46-60, 61-75, 76-90, 91-100. I only come to 7 files with a maximum of 500,000 lines. This isn’t very special, and it’s going through in less than a minute.

Or you said you were wrong and mean something else.

W00dp3ckr
1 year ago
Reply to  W00dp3ckr

So, if you want the file groups of 15 files to be optimal in the sense that they each contain the maximum number of numbers (some complete, others incomplete), then you need to build an index. The index contains the number in which files are present. This allows you to quickly answer the question of how many file groups you can create maximum that contains all numbers. You can also search “fail first”, i.e. you take the number that is most rare, take the first file that contains it, and then you will find the file that has the least overlap. So you go ahead until you have a group full. Then take the next unused document in the list.

This would perhaps lead to a “good” solution. Overall, I think the problem is NP-hart.

W00dp3ckr
1 year ago

Yeah, 15 out of 100 is a lot of banal. That’s why you need to be smarter. And that’s why you need to put yourself together with someone who can program and understand the problem you’re trying to solve.

W00dp3ckr
1 year ago

It becomes more interesting if you have an optimality criterion. That would have to be an index.

Erzesel
1 year ago

trivial

I don’t make the effort [01][02][43] to convert into a number is simply sorting the strings You want to know only many elements left anyway….

Using millions of data lines from files is a “house number” in every language What the program does in memory is less the problem than reading the files!

Iterate via the desired files

  • add all lines of the current file to a buffer.
  • sort the lines without duplicates (depending on language: “sort unique.Distinct(), etc
  • check with Array.Count/List.Count or ~.Size (different in each language) how many elements (lines) are in the list
  • If there are less elements than the specified value, you will notice the previously read in a 2nd buffer and continue with the next file
  • the default is exceeded

..the 2.Buffer you can then write into a file of your choice or whatever.

to get a block of 15 files

Instead of useless pseudocode…

Quick&Dirty in Powershel (Get-Content is not necessarily a racehorse when reading large files, but more clearly for a demo) :

$StopFlag=$False
$BaseNum=0
$WriteBuffer=@()
$BaseNum..($BaseNum+14)|
    ?{$StopFlag -eq $False}| #solange keine Abbruchbedingung
    %{$ReadBuffer=@()}{
        $FileName='file{0:d3}.txt'-f $_  #zu lesenden Dateinamen  aus de jeweils übergenebeb Dateinummer zusammenbasteln (file000.txt, file000.txt ...usw)
        Write-Host "Fuege $FileName zum Buffer"
        $ReadBuffer+=Get-Content $FileName -ReadCount 0
        $ReadBuffer=$ReadBuffer|Sort-Object -Unique
        Write-Host "Einzigartige Strings in ReadBuffer: $($ReadBuffer.Count)" -fo blue
        if ($ReadBuffer.Count -lt 500000) {
            Write-Host "ubernehme  ReadBuffer in WriteBuffer" -fo green
            $WriteBuffer=$ReadBuffer  #in 2.Buffer...Kopieren
        }
        else {
            $StopFlag=$True  # Loop abbbrechen
            Write-Host "Maximum Strings ueberschritten" -fo red
            Write-Host "ReadBuffer verworfen" -fo red
        }
    }
  $writebuffer.count
Schachpapa
1 year ago

I believe that if you can formulate your problem in a comprehensible way, the solution is no longer far away.

To load zb 100 text files. In these are about 300,000 lines with pay. In total there are only 500,000 different pays. Now I always want to add 15 files.

So 300,000 numbers per file? What criteria should 15 files be merged? What’s that?

Schachpapa
1 year ago
Reply to  Ranjahh

Now I always want to add 15 files.

If zb 6 files reach the full number of 500,000, it will not be further combined.

Whatever you want…

So you take file 1 with 300,000 numbers, add from file 2 all the numbers that are not already in, then file 3 etc. until you have either used file 15 or the target file contains 500,000 numbers? If this is reached at file 10, 11-15 will be ignored?

Do you set any requirements for the format of the target file? Or does it just contain numbers?

ralphdieter
1 year ago

I still don’t know what you want.

You obviously have a lot of D={ D_1, D_2, …} of files that can each be seen as a lot of D_i of lines. You know the amount Z of all possible lines. And now you’re looking for groups of 15 files whose association not Z is. You need:

  • any such group?
  • all possible groups? That could be a lot!
  • a breakdown of all files into such groups? Then, however, there is always a remainder because 100 is not divisible by 15.

I would process such amounts of data in Python and manage for each line a set of files that contain this line. These quantities can be efficiently reproduced as integers, each bit corresponding to a file:

from os import listdir

# lineinfo: line --> files (as Bit-Set)
lineinfo = { (line, 0) for line in open("all.txt") }

index = 1 # 2, 4, 8, ... 2^n
datainfo = {} # index --> filename
for data in listdir("input"):
    datainfo[index] = data
    for line in open(data):
        lineinfo[line] |= index # total count
    index *= 2

If after that any row occurs in at most 85 (=100-15) files, any 15 files from the rest are a possible solution. Otherwise (if all lines occur at least 86 times) there is no solution:

for line, lineset in lineinfo.items():
    int count = 0
    int idx = 1
    while idx= 15:
        int idx = 1
        while idx

A bit more difficult it becomes to enumerate all possible groups. Actually, each group of 15 files (as above) counts for each line. However, groups are counted several times if two or more lines are missing in their association. It is certainly not practical to remember all already calculated groups in order to avoid doublettes, because they are much too many.

In principle, you can ignore all rows that occur in 86 or more files (because they occur in each group), and then sort out the files that are in the cut of the other rows. This is likely to greatly reduce the amount of data.

However, how to proceed exactly depends on whether you want to distinguish two different file groups with the same amount of association:

  • If not, you can find iterative for any interesting line z_i all the solutions that z_i not and then only consider solutions containing z_1 to z_i. That should be really hard.
  • If you don't, you'd have to do something to don't have to iterate over all kinds of 15-some groups...